**The First Principle of Information Theory** Ashwin Reddy # Communication For most of us, information is the content of a message. A text message about a dinner event informs you of the time, venue, and dress code. In contrast to this view, well [understood](https://www.cs.bham.ac.uk/research/projects/cogaff/misc/austen-info.html) by novelist Jane Austen, stands mathematician-engineer Claude Shannon, the father of information theory. > "The fundamental problem of communication is that of reproducing at one point either exactly or approximately > a message selected at another point. Frequently the messages have _meaning_; that is they refer > to or are correlated according to some system with certain physical or conceptual entities. These semantic > aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual > message is one _selected from a set_ of possible messages. The system must be designed to operate for each > possible selection, not just the one which will actually be chosen since this is unknown at the time of design.[^paper]" > ******************************************************************************************************* * +--------------------+ +-------------+ +---+ +----------+ +-------------+ * * | information source +->| transmitter +-------->| +---------->| receiver +->| destination | * * +--------------------+ +-------------+ signal +---+ received +----------+ +-------------+ * * message ^ signal message * * | * * | * * +---+----+ * * | noise | * * | source | * * +--------+ * ******************************************************************************************************* Shannon asserts the semantic meaning is not the focus of the engineering task. As long as the transmitter and receiver agree on the symbols, they can forget the meanings of the symbols. But how can we analyze information, if it is not in the sense of the message? For communication to be useful, the message received has to do work, change the receiver's state in some way. Thus, more information implies a greater change in the receiver's state. If the receiver already knew the information, nothing was received. To quantify information is to model receiver expectations. In short, information is surprise. Take Wordle as an example.[^3b1b] Knowing a word has _E_ or _A_ is not surprising or informative. Lots of words have these letters. On the other hand, you would be surprised to learn the word has a _Z_ because there are much fewer words to consider now. Information as surprise is the crux of information theory. Get a strong grip on this notion and its notation and the rest unfolds smoothly.[^notation] # Surprise Information theory uses probability to determine how surprising an event is. First, though, we must specify the symbols used by the transmitter and receiver. Call it an **alphabet** $\mathcal{X}$, for example, the English alphabet. Suppose the surprise or unexpectedness of an event with probability $p$ is $s(p)=\frac{1}{p}$. Shannon argues the logarithm of this quantity is a better measure because the information from independent events should add together. For an event with probability $p_1$ and another, independent event with probability $p_2$, stipulate that \begin{equation} \label{additivity} s\left(p_1 p_2\right) = s(p_1) + s(p_2), \end{equation} which entails that $s(p) = k \log(p)$ for some $k$.[^cauchy] If $k=-1$, then $s(p)$ is both a function of $\frac{1}{p}$ and obeys equation [additivity]. Surprise : For an event with probability $p$, the surprise is $$s(p) = \log_b\left( \frac{1}{p}\right).$$ !!! Tip Choose $b$ as you like, since it only changes $k$ by a constant factor. Typical choices are $e$ for mathematical analysis and 2 or 10 for practical applications. With base 2, the unit of surprise is bits, so that 1 bit is the information in a fair coin flip. To make this concrete, look at a plot of $s(p)$ against $p$. ![Figure [surprise]: Surprise as a Quantiative Measure](https://i.imgur.com/EnP5VHz.png) 1. Information is always non-negative. $$s(p) \geqslant 0.$$ 2. There is no surprise for a guranteed event. $$s(1) = 0.$$ 3. There is infinite surprise for an impossible event. $$s(0) = \infty.$$ 4. $s(p)$ monotonically increases as $p$ decreases. We are more surprised as an event becomes less likely. Surprise is a function of only a single symbol, but the receiver is free to use all the symbols in $\mathcal{X}$. We need to construct a measure of surprise over the entire alphabet. # Entropy The receiver and transmitter don't necessarily share the same probability distribution for the symbols. The random variable $X$, representing a symbol from the alphabet, has a true distribution $p$, while the receiver supposes it has a distribution $q$, which may be the same or different. What's the surprise, in bits, when $p$ and $q$ communicate? Cross-Entropy : The cross-entropy $H_q$ of a random variable $X$ with true distribution $p$ and assumed distribution $q$ is given by the expected value $$H_q[X] = H\big(p(X) \parallel q(X)\big) = \mathop{\mathbb{E}}_{x \sim p(x)}\big[s\big(q(x)\big)\big]$$ To understand cross-entropy, break down the expression step by step. |Symbol|Action| |------|-------| |$x \sim p(x)$|Transmitter samples a message from $p$ and sends it| |$q(x)$|Receiver decides on message likelihood| |$s(\cdot)$|Receiver computes surprise| |$\mathbb{E}$|Repeat previous steps, take the average in the limit| The cross-entropy is high when $q$ receives symbols from $p$ that it marked unlikely. But, as $q$ and $p$ become more similar, the cross-entropy decreases. Naturally, this condition, of the receiver and transmitter agreeing, is an important case. In fact, it is given the name **entropy**. Entropy : \begin{equation}\label{ent-def}H[X] = H\big(p(X)\big) = H\big(p(X) \parallel p(X)\big)\end{equation} In many treatments of information theory, you'll see that entropy is presented as \begin{equation} \label{classicdef} H[X] = -\sum_{x \in \mathcal{X}} p(x) \log p(x), \end{equation} which shows how to compute it more easily. !!! WARNING Some authors define entropy with an integral $H[X] = -\int_{\mathcal{X}} f(x) \log f(x) \,\mathrm{d}x$. This is called differential entropy and it shares some, but not all, properties of discrete entropy, which we're discussing now. Eqn. [ent-def] and eqn. [classicdef] are equivalent, but I've delayed showing the latter because it buries the notion of surprise. The story goes that John von Neumann suggested Shannon call the quantity entropy for its resemblance with a similar formula in thermodynamics. Shannon entropy attains its maximimum of $\log |\mathcal{X}|$ for a uniform distribution over $\mathcal{X}$, which represents maximum disorder. The simplest entropy calculation we can do is for a random variable $X \sim \mathsf{Bernoulli}(p)$ where $p$ is the probability of a heads. ![Figure [bernoulli]: Entropy of a coin flip as a function of $p$](https://i.imgur.com/gsNqGnZ.jpg) When $p=\frac{1}{2}$, we are the most surprised about the coin flip because both possibilities are equally likely. It takes 1 bit, as expected, to describe it. As one side becomes more and more likely, we need less bits to describe the flip. It is often useful to refer to the difference between the cross-entropy and entropy, which has both names of Kullback–Leibler (KL) divergence and relative entropy. We'll use the latter for thematic consistency. Relative Entropy : The difference between the cross-entropy and entropy of a random variable $$D_{\rm KL}(p(X) \parallel q(X)) = H(p(X) \parallel q(X)) - H(p(X))$$ We have defined three terms in quick succession, so let us summarize. If messages $X$ are sampled from $p$ and expected to come from $q$, then on average the receiver has surprise $H_q[X]$. The lowest this average surprise could possibly be is $H[X]$, which occurs when $p=q$. For any other situation, there is some suboptimality coming from the difference between $p$ and $q$, which is captured by the relative entropy $D_{KL}(p(X) \parallel q(X))$. \begin{equation} \label{bits} \underbrace{H_q[X]}_{\text{Actual}} = \underbrace{H[X]}_{\text{Theoretical}} + \underbrace{D_{KL}(p(X) \parallel q(X))}_{\text{Difference}} \end{equation} We won't prove it, but it should make intuitive sense that relative entropy is never negative: $$ D_{\rm KL}(p(X) \parallel q(X)) \geqslant 0 \tag{Gibbs' Inequality} $$ # Coding We've now built up enough theory to explain one of Shannon's key results. Source coding theorem : The best you can do in losslessly compressing a word made of $n$ i.i.d. symbols is $n H[X]$. We will not discuss how one performs this compression, except to point the reader toward the field of **coding theory** and the technique of Huffman encoding, which you should now have the background to understand. # Channels We have yet to address how the transmitter sends symbols and the receiver accepts them. In information theory lingo, the medium of communication is called the **channel**. In the real world, channels may corrupt the symbols that the transmitter sends, leaving the receiver to reconstruct or guess the intended message. Two classic examples of channels are the **binary symmetric channel (BSC)** and the **binary erasure channel (BEC)** . In the binary symmetric channel, you send one bit, and there is a probability $p_e$ that the channel flips it to the wrong value. Here is a schematic diagram. ****************** * X Y * 1 - p_e * 0 *-------* 0 * \ / * p_e \ / * \ / * + * / \ * p_e / \ * / \ * 1 *-------* 1 * 1 - p_e ****************** Here is the probability matrix. |X\Y|0|1| |--|--|--| |0|$1-p_e$|$p_e$| |1|$p_e$|$1-p_e$| In the binary erasure channel, there is a probability $p_e$ that instead of sending the desired message, the receiver gets a corrupted symbol, which we denote with a question mark (_?_). Again, here is a schematic diagram. *********************************** * X Y * * 1 - p(erasure) * * 1+--------------------> 1 * * \ * * \ p(erasure) * * +-----------------> ? * * / * * / * * 0+--------------------> 0 * * 1 - p(erasure) * *********************************** Here is the probability matrix. |X\Y|0|1|?| |--|--|--|--| |0|$1-p_e$|0|$p_e$| |1|0|$1-p_e$|$p_e$| Let us assume the channel is fixed. We can't adjust the channel behavior, but we can be clever about which messages the transmitter sends. Let's call the message sent $X$ and the message received $Y$. We'll introduce a new concept called **mutual information**, denoted $I[X; Y]$, which will measure how much information we get about $Y$ when we know $X$. Then, we'll try to maximize this quantity so that the receiver has the $Y$ which is most tightly coupled to $X$. It will be useful to look at a diagram, although we haven't yet defined all our terms. ![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/2560px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png) The goal is to make the overlap between $X$ and $Y$ as big as possible. But to do this, we need to define the total area of the circles $H[X,Y]$ and the portion of the circles without the overlap $H[X \mid Y]$ and $H[Y \mid X]$. The first term, $H[X, Y]$, is called joint entropy because it depends on the joint distribution, and it measures how many bits we need to describe both variables. Joint entropy : $$H[X, Y] = \mathop{\mathbb{E}}_{p(x,y)}\left[h(p(x,y))\right]$$ The terms $H[X \mid Y]$ and $H[Y \mid X]$ are conditional entropies because they depend on the conditional distribution, and they tell us how many bits we need to describe one of the variables if we know the other. Conditional entropy : $$H[X \mid Y] = \mathop{\mathbb{E}}_{p(x,y)}\left[h(p(x \mid y))\right]$$ Use the diagram to confirm for yourself that $$ H[Y] + H[X \mid Y] = H[X, Y]. $$ I read this as saying the following: if I have a representation for $Y$ and a way to encode $X$ when I know $Y$, then I can combine them to encode both $X$ and $Y$. But we now come to mutual information. Mutual information : Mutual information has many equivalent definitions. \begin{align*}I[X; Y] &= H[Y] - H[Y \mid X] \\ &= H[X] - H[X \mid Y] \\ &= H[X, Y] - H[X \mid Y] - H[Y \mid X] \\ &= H[X] + H[Y] - H[X, Y] \\ &= D_{\rm KL}\left(p(X, Y) \parallel p(X)p(Y) \right)\end{align*} All but the last equation can be easily verified with the diagram. The interpretations I like best are the first and last. - $I[X; Y] = H[Y] - H[Y \mid X]$. Mutual information is the distance between the entropy of $Y$ knowing nothing and the entropy of $Y$ having seen the input $X$. - $I[X; Y] = D_{\rm KL}\left(p(X, Y) \parallel p(X)p(Y) \right)$. Mutual information is how inefficient it would be to treat the input $X$ and $Y$ as independent. Notice that $I[X; Y] = I[Y; X]$, the connection between $X$ and $Y$ is symmetric. We are close to wrapping everything up neatly. Recall the channels we introduced earlier: BSC and BEC. We can't affect the channel, which converts our sequence $X^n$ into a $Y^n$. The best we can do is be clever about $X$. Therefore, Shannon defines the **channel capacity** as the highest rate of reliable information flow in the channel. Channel capacity : $$C = \max_{p(x)} I[X; Y]$$ In the best case, our channel is perfect, so $Y=X$, and $C = I[X;X] = H[X]$. But the BSC has channel capacity $1 - H_b(p)$, and BEC has $1 - P_e$. The real significance of the channel capacity is that it Shannon proves it is a speed limit for communication. Noisy-channel coding theorem : Suppose you are willing to allow an error rate of $p_b$ in your channel. Then the highest rate you can send symbols through is $R = \frac{C}{1 - H[Z]}$ for $Z \sim \mathsf{Bernoulli}(p_b)$. In particular, if you have $p_b = 0$, then $R = C$, you can achieve arbitrarily low corruption errors. Which concludes our presentation of the basic results that Shannon first showed, using the core metaphor that information is surprisal. # Notes [^paper]: _A Mathematical Theory of Communication_ (1948). [Source](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf) [^3b1b]: 3Blue1Brown did a nice [analysis](https://www.youtube.com/watch?v=v68zYyaEmEA) from an information-theoretic standpoint. [^notation]: [A Practical & Unified Notation for Information-Theoretic Quantities in ML](https://arxiv.org/abs/2106.12062). [^cauchy]: The equation goes by the name of _Cauchy logarithmic functional equation_.