Josiah Willard Gibbs In information theory , Gibbs' inequality is a statement about the information entropy of a discrete probability distribution . Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality . It was first presented by J. Willard Gibbs in the 19th century.
Gibbs' inequality Suppose that P = { p 1 , … , p n } {\displaystyle P=\{p_{1},\ldots ,p_{n}\}} and Q = { q 1 , … , q n } {\displaystyle Q=\{q_{1},\ldots ,q_{n}\}} are discrete probability distributions . Then
− ∑ i = 1 n p i log p i ≤ − ∑ i = 1 n p i log q i {\displaystyle -\sum _{i=1}^{n}p_{i}\log p_{i}\leq -\sum _{i=1}^{n}p_{i}\log q_{i}} with equality if and only if p i = q i {\displaystyle p_{i}=q_{i}} for i = 1 , … n {\displaystyle i=1,\dots n} .[ 1] : 68 Put in words, the information entropy of a distribution P {\displaystyle P} is less than or equal to its cross entropy with any other distribution Q {\displaystyle Q} .
The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:[ 2] : 34
D K L ( P ‖ Q ) ≡ ∑ i = 1 n p i log p i q i ≥ 0. {\displaystyle D_{\mathrm {KL} }(P\|Q)\equiv \sum _{i=1}^{n}p_{i}\log {\frac {p_{i}}{q_{i}}}\geq 0.} Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal " measured in bits .
Proof For simplicity, we prove the statement using the natural logarithm, denoted by ln , since
log b a = ln a ln b , {\displaystyle \log _{b}a={\frac {\ln a}{\ln b}},} so the particular logarithm base b > 1 that we choose only scales the relationship by the factor 1 / ln b .
Let I {\displaystyle I} denote the set of all i {\displaystyle i} for which pi is non-zero. Then, since ln x ≤ x − 1 {\displaystyle \ln x\leq x-1} for all x > 0 , with equality if and only if x=1 , we have:
− ∑ i ∈ I p i ln q i p i ≥ − ∑ i ∈ I p i ( q i p i − 1 ) {\displaystyle -\sum _{i\in I}p_{i}\ln {\frac {q_{i}}{p_{i}}}\geq -\sum _{i\in I}p_{i}\left({\frac {q_{i}}{p_{i}}}-1\right)} = − ∑ i ∈ I q i + ∑ i ∈ I p i = − ∑ i ∈ I q i + 1 ≥ 0 {\displaystyle =-\sum _{i\in I}q_{i}+\sum _{i\in I}p_{i}=-\sum _{i\in I}q_{i}+1\geq 0} The last inequality is a consequence of the pi and qi being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero qi , however, may have been excluded since the choice of indices is conditioned upon the pi being non-zero. Therefore, the sum of the qi may be less than 1.
So far, over the index set I {\displaystyle I} , we have:
− ∑ i ∈ I p i ln q i p i ≥ 0 {\displaystyle -\sum _{i\in I}p_{i}\ln {\frac {q_{i}}{p_{i}}}\geq 0} , or equivalently
− ∑ i ∈ I p i ln q i ≥ − ∑ i ∈ I p i ln p i {\displaystyle -\sum _{i\in I}p_{i}\ln q_{i}\geq -\sum _{i\in I}p_{i}\ln p_{i}} . Both sums can be extended to all i = 1 , … , n {\displaystyle i=1,\ldots ,n} , i.e. including p i = 0 {\displaystyle p_{i}=0} , by recalling that the expression p ln p {\displaystyle p\ln p} tends to 0 as p {\displaystyle p} tends to 0, and ( − ln q ) {\displaystyle (-\ln q)} tends to ∞ {\displaystyle \infty } as q {\displaystyle q} tends to 0. We arrive at
− ∑ i = 1 n p i ln q i ≥ − ∑ i = 1 n p i ln p i {\displaystyle -\sum _{i=1}^{n}p_{i}\ln q_{i}\geq -\sum _{i=1}^{n}p_{i}\ln p_{i}} For equality to hold, we require
q i p i = 1 {\displaystyle {\frac {q_{i}}{p_{i}}}=1} for all i ∈ I {\displaystyle i\in I} so that the equality ln q i p i = q i p i − 1 {\displaystyle \ln {\frac {q_{i}}{p_{i}}}={\frac {q_{i}}{p_{i}}}-1} holds, and ∑ i ∈ I q i = 1 {\displaystyle \sum _{i\in I}q_{i}=1} which means q i = 0 {\displaystyle q_{i}=0} if i ∉ I {\displaystyle i\notin I} , that is, q i = 0 {\displaystyle q_{i}=0} if p i = 0 {\displaystyle p_{i}=0} . This can happen if and only if p i = q i {\displaystyle p_{i}=q_{i}} for i = 1 , … , n {\displaystyle i=1,\ldots ,n} .
Alternative proofs The result can alternatively be proved using Jensen's inequality , the log sum inequality , or the fact that the Kullback-Leibler divergence is a form of Bregman divergence .
Proof by Jensen's inequality Because log is a concave function, we have that:
∑ i p i log q i p i ≤ log ∑ i p i q i p i = log ∑ i q i = 0 {\displaystyle \sum _{i}p_{i}\log {\frac {q_{i}}{p_{i}}}\leq \log \sum _{i}p_{i}{\frac {q_{i}}{p_{i}}}=\log \sum _{i}q_{i}=0} where the first inequality is due to Jensen's inequality, and q {\displaystyle q} being a probability distribution implies the last equality.
Furthermore, since log {\displaystyle \log } is strictly concave, by the equality condition of Jensen's inequality we get equality when
q 1 p 1 = q 2 p 2 = ⋯ = q n p n {\displaystyle {\frac {q_{1}}{p_{1}}}={\frac {q_{2}}{p_{2}}}=\cdots ={\frac {q_{n}}{p_{n}}}} and
∑ i q i = 1 {\displaystyle \sum _{i}q_{i}=1} . Suppose that this ratio is σ {\displaystyle \sigma } , then we have that
1 = ∑ i q i = ∑ i σ p i = σ {\displaystyle 1=\sum _{i}q_{i}=\sum _{i}\sigma p_{i}=\sigma } where we use the fact that p , q {\displaystyle p,q} are probability distributions. Therefore, the equality happens when p = q {\displaystyle p=q} .
Proof by Bregman divergence Alternatively, it can be proved by noting that q − p − p ln q p ≥ 0 {\displaystyle q-p-p\ln {\frac {q}{p}}\geq 0} for all p , q > 0 {\displaystyle p,q>0} , with equality holding iff p = q {\displaystyle p=q} . Then, sum over the states, we have ∑ i q i − p i − p i ln q i p i ≥ 0 {\displaystyle \sum _{i}q_{i}-p_{i}-p_{i}\ln {\frac {q_{i}}{p_{i}}}\geq 0} with equality holding iff p = q {\displaystyle p=q} .
This is because the KL divergence is the Bregman divergence generated by the function t ↦ ln t {\displaystyle t\mapsto \ln t} .
Corollary The entropy of P {\displaystyle P} is bounded by:[ 1] : 68
H ( p 1 , … , p n ) ≤ log n . {\displaystyle H(p_{1},\ldots ,p_{n})\leq \log n.} The bound is achieved when q i = 1 / n {\displaystyle q_{i}=1/n} for all i .
See also
References ^ a b Pierre Bremaud (6 December 2012). An Introduction to Probabilistic Modeling . Springer Science & Business Media. ISBN 978-1-4612-1046-7 . ^ David J. C. MacKay (25 September 2003). Information Theory, Inference and Learning Algorithms . Cambridge University Press. ISBN 978-0-521-64298-9 .