"Whenever you can, share. You never know who all will be able to see far away standing upon your shoulders!"
I write mainly on topics related to science and technology.
Sometimes, I create tools and animation.
Jan. 12, 2021
Author - manisar
If you work in, or have interest in Information Technology or Mathematics, then Information Theory is something you must be acquainted with. If you haven't heard of it, check the section Information Theory - Quick Introduction on this page. What I am providing below is an intuitive build-up which is generally not present in most explanations of the Information Theory - including the one that came from its founder.
If you are in a hurry, you may jump to the section Defining Information or even to Let's Seal the Deal.
While having established so many relationships directly or indirectly, the most basic and the most famous equation the Information Theory has brought forth is the one for Entropy, which actually stems out from the definition of the information content, or the information associated to any event in a probability distribution. The (informational) entropy of a system is given by:
$$H = - \sum_i{p_i \log_2({p_i})}$$
The individual terms on the RHS are a product of two factors - the probability \(p\) of an event and the \(log_2\) of this probability \(p\). This second factor is what the mystery is all about. This is called the information content or simply the information related to the outcomes in an event. What, on earth, does this mean? I mean apart from the fact that this definition of information brings the subsequent entropy formula at par with what we have for entropy in physics, what actually does it signify? It turns out that this definition of information does have a very deep metaphysical meaning and with some interesting insight it comes live with great practical meaning and interpretation. Let's see how.
In very simple terms, the information of any object is the smallest effort needed to identify the object among similar objects. By similar is meant objects of same frequency or probability.
E.g. 299 792 458 m/s is a big number (followed by a unit), but if that is the only value speed of light can have (in vacuum), we don't need any symbol to identify what value the speed of light currently has (in any experiment in vacuum). Just by saying 'speed of light', we'll mean one and only one value. So, the information of this entity can be said to be 0.
Similarly, if any event-outcome or system-state has a probability or frequency of 0.25, this means that we can identify such outcomes or states among all the similar outcomes or events by using no more than 4 symbols. Let's call these symbols units. E.g. we can use ★ to represent first such outcome, ★★ to represent the second... ★★★★ to identify the fourth. We won't need more than 4 ★s for identifying any such outcome. So, the information of such an outcome or state can be said to be 4 (it's not unfair to see the effort as the number of units needed to identify the outcome or state).
By our analysis so far, we can say $\boxed { i = \frac{1}{p} }$, where p is the probability of the event (this definition will mean $p=1$ $\implies$ $i=1$ instead of $0$, but we can live with that for now).
But, the effort can be reduced by letting our unit have more than one values. E.g. if we use ★ and ■, our effort will be reduced by half. Any outcome of 0.25 probability among its siblings can now be identified by just two of ★ and ■.
Suppose we needed 64 univalued units to identify an outcome. Now, if we let each of our units have one of two values, we'll need only 6 such units. If we let each unit have 4 values, we'll need only 3 units. The new effort is the $\log$ of the original effort with a base equal to the number of values our unit can take. But the original effort was $\frac{1}{p}$. This means that if we use bits (units that can have one of two values), we can define information to be: $$ \boxed { i = \log_2{\frac{1}{p}} } $$
Obviously, this definition will be modified based upon the unit, but given a unit, this is what is meant by the information of an outcome or state.
Note that the absolute information of an outcome is reduced by using multivalued units because in using such a scheme, each unit carries some information individually, and by information of the outcome, we now mean the information apart from the information contained in the units itself. If we combine these two, we get the original information that can be used to identify the outcome uniquely. And now, instead of saying just information, we more rightly say that an outcome has $i$ bits of information (or $i$ units in general).
When we want to describe (or communicate) a specific outcome of an event (or the state of a system) to a listener (receiver), we do so by sending some "information" that goes from the sender to the receiver. And, we can do it in two ways:
For example, if you need to tell me over telegram which country won the (most medals in) 2018 winter Olympics, you can either tell me that it was "Norway", or if we both have an identical list of countries, you can tell me that it was the country number "130".
With the latter, the disadvantage is that we need to have this list of countries in advance, but it is greatly overcome by the fact that once we have this list, every telegram will cost much less.
In this (latter) approach, we always make use of some kind of coding scheme, which is another way of saying that we look at the outcome(s) in a context.
A rather unintuitive realization is that even when you think are not using a coding scheme, you mostly are!
In the example above, if you mention "Norway" instead of "130", you are actually encoding a representation of the "sound-of-the-14th-letter followed by the sound-of-the-15th-letter and so-on" - these positions being in an identical list of sounds we both have - the Latin alphabets.
Now, the next natural question is to ask: Ok, by sending a code (instead of actual description of the outcome), we have saved some money, but can we further reduce the cost of our telegram? The answer is yes. If it's Norway which wins quite often, using 1 or 2 for Norway is a better choice than using 130. 130 can be used for a country that rarely wins. So, we see that by bringing in a coding scheme or a context, we have implicitly brought in probabilities. We can, of course, choose not to use probabilities in our encoding, but they are there.
Whether or not we choose to use the probabilities in our encoding, it is important to understand that in communicating via a coding scheme, the "information", that is sent and that results in the identification of an outcome (of an event) at the receiver's end, can be, and mostly is, completely different from how the outcome actually looks like. "130" is different from "Norway".
The definition of information, whatever it may be, should include points number 1 and 3 above. That is, the information of an event outcome or system-state:
The next step is to see if we can have a more defined relationship between information, coding scheme and probability. If yes, we may be able to use it in a proper way for reducing the amount of information that needs to be sent.
Let's look at incorporating the coding scheme first. For defining information, we have a starting point - we saw above that if we encode an outcome of higher probability (Norway) in such a way that a smaller codeword (i.e. a smaller piece of information) can represent it, we are better off.
Can we say that it is the length of the codeword that is its information?
But in defining the length-of-the-codeword-that-represents-the-outcome as its information we have not made use of the probability of the outcome at all which was seen to have a direct reciprocal effect on information that we needed to send. Let's denote information with \(i\) and probability with \(p\).
Ok, so what if we say \(i = 1/p\)?
There are at least two problems now.
First, if the probability of an outcome = \(1\), we absolutely don't need any information to be sent, and so \(i\) should be zero, which is not happening with the formula used above. Let's say we use \(i = (1/p) - 1\).
Now, for \(p = 0.25\), we'll get \(i = 1/0.25-1 = 3\). Ok maybe, but can we do better?
And, look at this issue as well. For argument's sake, say the probability of Norway, Canada, Russia and Germany winning are all \(0.25\) each. We can encode all four of these outcomes using only 2 bits. If we go by \(i=1/p \text{, or } (1/p) - 1 \), \(i_{Norway}\) itself is \(4\) or \(3\). So at this point of time, we do have a relationship that uses the probability of the outcome, but the "information" we defined this way seems to have no connection with the length of the codeword. It depends only on probability now (we have swung a bit too far the other way).
We are looking for a formula for \(i\) that should be related to the length of the codeword-that-represents-the-outcome and that length should be somehow related to the probability of the outcome.
For finding the length of the codeword in order to describe an outcome in event, we need to know the total number of outcomes. For 2 outcomes, we need 1 bit. For 4, we need 2, and so on. But, what we have in hand is the probability of the outcomes.
How can we find the total number of outcomes using the probability of each outcome? It is actually very easy.
Say, we have an event that can have any of \(n\) possible outcomes with each outcome denoted as \(o_i\) having a probability \(p_i\). We saw above (just before the beginning of this section) that with \(i\) bits we will be able to encode all the possible outcomes of this event, where \(i\) is given by the equation \(n = 2^i\). But what can be the upper limit of \(n\) if we know all the \(p_i\)'s? It's interesting to note that the smallest \(p\) solely limits the upper bound of \(n\)!
Let's say \(p_j\) is the smallest. Then we can write \(p_i = p_j+a_i\) for every \(i\) where \(a_i\) is \(0\) or a small positive number. So, we have:
$$\begin{align}
1 & = p_1 + p_2 + \cdots + p_n\\[6pt]
& = (p_j + a_1) + (p_j + a_2) + \cdots + (p_j + a_n)\\[6pt]
& = n.p_j + (a_1 + a_2 + \cdots + a_n)\\[6pt]
& \ge n.p_j \\[6pt]
\Rightarrow n & \le \frac{1}{p_j} \tag{1}\label{1}
\end{align}$$
So, the total number of outcomes of such an event (where \(o_p\) is the smallest probability outcome having a probability \(p\)) cannot be more than more than \(\frac{1}{p}\). And this, in turn, means that \(i\) bits, where \(i\) is given by the equation \(\frac{1}{p} = 2^i\text{, or }i = \log_2(\frac{1}{p})\), can be used to encode all the outcomes of such an event - including \(o_p\).
This means that the length of the codeword for representing \(o_p\) can have a safe upper limit as \(\log_2(\frac{1}{p})\) - we'll be able to encode this and every other outcome in such an event using \(\log_2(\frac{1}{p})\) bits.
So, can we define information for an outcome that has a probability \(p\), as \(\log_2(\frac{1}{p})\), in an event where this \(p\) is the smallest? We are close.
So, we have seen that for an outcome with a probability \(p\), \(i =\log_2(\frac{1}{p})\) bits are sufficient to encode it in an event in which this \(p\) is the smallest (among the probabilities of all other outcomes).
The question is - what if this \(p\) is not the smallest, i.e. there are outcomes of smaller probabilities? Do these many bits still suffice for encoding this outcome, or do the additional outcomes will force us to use more bits for encoding even the existing outcomes? If these many bits are sufficient, we can get rid of the part "in an event where this \(p\) is the smallest" from the definition of \(i\)? It turns out, we can!
First, note (as proven above) that for the number of outcomes \( n_p \) in an event in which \(p\) is the lowest probability outcome, \(n_p \le \frac{1}{p} \), meaning \( n_p \le 2^i \), and this means that \( n_p = 2^i - \epsilon \), where \( \epsilon \) is zero or a positive integer and \(i = \log_2(\frac{1}{p})\). Here, \( \epsilon \) number of encodings are free, i.e. not being used for representing any outcome.
Now, let's say we have added more outcomes to the event with \(p^\prime\) being the new lowest probability outcome. So, the maximum number of outcomes now becomes \(\frac{1}{p^\prime}\), and let $i^\prime$ denote the new number of bits needed to code any of these \(\frac{1}{p^\prime}\) outcomes. This means that the maximum number of additional outcomes will be given by:
\[\begin{align}
\Delta n & = n_{p^\prime} - n_p \\[6pt]
& \le \frac{1}{p^\prime} - n_p \text{, } \because n_{p^\prime} \le \frac{1}{p^\prime} \\[6pt]
& = \frac{1}{p^\prime} - (2^i - \epsilon) \text{, where } i = \log_2(\frac{1}{p}) \\[6pt]
& = \frac{1}{p^\prime} - 2^i + \epsilon \\[6pt]
\therefore \Delta n& \le 2^{i^\prime} - 2^i + \epsilon \text{, where } i^\prime = \log_2(\frac{1}{p^\prime}) \\[6pt]
\end{align}\]
But, guess what, \(2^{i^\prime} - 2^i \) is the number of outcomes that can be encoded strictly using the additional (\(i^\prime - i\)) bits. And \( \epsilon \) number of outcomes are anyway encodable using the unused arrangements of \( i \) bits.
For example, let's say we added $5$ new outcomes to $3$ outcomes (encodable by $2$ bits), and now we have $8$ outcomes (encodable by $3$ bits). Among the $5$ newly added outcomes \( 2^3 - 2^2 = 4\) is the maximum number of outcomes that will strictly need the additional \(3-2 = 1\) bit. And the remaining $1$ outcome was already encodable using $2$ bits itself.
Thus, the addition of $5$ new outcomes (even of lower probability than any present before) doesn't affect the codability of the original $3$ outcomes! Those are still encodable using $i$ bits only.
So, we see that all the additional outcomes can be encoded using only the additional bits (and the unused arrangements of \(i\) bits if needed), and the original \(i\) bits are free to be used for encoding all the original outcomes. So, we can simply say that for an outcome that has a probability p, \(\bf{i = \log_2(\frac{1}{p})}\) bits can encode it in any event.
Well, this is what we were looking for - the length of the codeword that can represent an outcome based upon its probability \(p\) in any event - and it is fair to call this length the information of the outcome.
$$\boxed{\begin{alignat}{4}
i &= \log_2(\frac{1}{p}),&& \quad \text{or }\\[6pt]
i &= \ln(\frac{1}{p})&& \quad \text{(using nats) }
\end{alignat}}$$
The information \( i \) associated with any outcome of probability \( p \) is the minimum number of bits that can be used for encoding this outcome in any event having any number of outcomes (and this number is solely given by the probability of the outcome as \(\bf{i = \log_2(\frac{1}{p})}\)).
Note that this definition depends upon the unit we are using (bits, nats etc.), but given a unit, we have a well-defined information.
The amazing fact that the information of an outcome (i.e. the amount of data needed to represent this outcome) is independent of the total number of the outcomes of the event cannot be overemphasized!
\( p=0.25 \Rightarrow i = 2 \) means that we need \( 2 \) bits for encoding this outcome in any event (that can have 2, 4, or 4000 outcomes).
\( p=0.125 \Rightarrow i = 3 \) means that we need \( 3 \) bits for encoding this outcome in any event.
Thus, we need not think why and if a logarithmic function was chosen randomly to specify the information related to a probability. The information of any outcome with a given probability is simply the length of the codeword that can represent this outcome in an event, and that length is given by the \(log\).
We can ask a question here - by this definition of information, two different outcomes that have the same probability will have the same information. How can we then use this information to uniquely identify such outcomes in an event? The answer is in the extra information contained in the units used. The information of the outcome in terms of the given units + the information in the individual units can be used to uniquely identify the outcomes. E.g., the information of any outcome of $p=1/8$ in an event is 3 (when using bits), and knowing the values of each of the three bits, we can identify the exact outcome.
When we say that the information of an outcome is 3, we mean that at least 3 multivalued units will be needed to uniquely identify this outcome. This doesn't mean that 3 of the given units, each having any value, will identify this outcome. We'll have to know the exact values of the units, and hence, it's the information of the outcome + the values of the units that are needed to uniquely identify the outcome. Summarily, it's more correct to name the specific units (along with information) now, e.g. a given state has $i$ bits of information associated to it.
Finally, we can spare a thought wondering what would be the information content of any event as such, that is, the information of all its outcomes combined? In other words, what is the information content of a system that can have different states with different probabilities? First of all, the information content of the event or the system is called Entropy, and secondly, in statistics, such a system (event) that can have different states (outcomes) with different probabilities is called a probability distribution.
For example, if there are three states (outcomes) with information \(i_1\), \(i_2\) and \(i_3\), what is the entropy of this system (event)? You may think it is given by the \(i\) of the lowest probability, or that it is \(i_1 + i_2 + i_3\), but you need to remember that from the system's points of view these states are not equal. Since these states or outcomes are not occurring with equal probability, i.e. \( p_1 \ne p_2 \ne p_3 \) in general, we should include a state's \(i\) only in the ratio of its \(p\). So, we multiply the probabilities with information individually and then take the sum. That is:
$$\begin{align}
H = & \sum_i{p_i.i} \\[6pt]
= & - \sum_i{p_i \log_2({p_i}})
\end{align}$$
As we can see, this is the weighted average of the information (or entropy) of individual states of a system. Hence, the entropy of a system describes the minimum number of bits that can be used to describe any state of the system on an average. Of course, some states will need more bits (those which are less probable), and vice versa. But these states will occur less frequently, so the previous effect (of using more bits) will be compensated to some extent. So, what we use is the weighted average. This is an amazing discovery which can be used to ascertain the efficiency of the encoding used in the communication of messages. Huffman Coding technique builds such an encoding respecting the entropy of the system.
Just the last bit (pun intended) - using nats, the entropy formula is stated as \(H=\sum_i{p_i \ln({p_i}})\). Conversion from nats to bits is straightforward - \( 1 \) nat = \( \frac{1}{\ln 2} \) bits, or \( 1 \) bit = \( \ln 2 \) nats.
Ok, last last bit - the whole article can be read substituting randomness for information. That is, we could have called information randomness (of an outcome, state, event or system), and things would have made sense in the same way (or perhaps more?). The more the probability the less the randomness. The more the number of values a unit can take, the less the length of the codeword (logarithmically) that can be used to define the outcome, and hence less the randomness.
Information theory is the scientific study of the quantification, storage, and communication of information.
...
The landmark event that established the discipline of information theory and brought it to immediate worldwide attention was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October 1948.
The basic premise is:
Often is used the term 'surprisal' to explain the second point above. E.g. if an outcome is 100% probable, then on knowing about it, we have learned nothing - it's not a surprise at all. while on learning about an outcome that has a probability of .05%, we are quite surprised. And, so we say that the latter has more information.
In his paper, Shannon chose to describe this inverse relationship by using a logarithmic function, i.e. \( \text{information} = \log_2({p}) \), and uses it to find the entropy of the system, which in turn is found to describe the minimum number of bits that can be used to describe any configuration of the system (i.e. any arrangement of all its probable states).
One major application of the above is Huffman Coding, and there are many, including the majorly used cross-entropy loss function in Machine Learning.
Go back to the start of the article or to Defining Information or Let's Seal the Deal section.