6.441-2 Jensen's inequality, Data processing theorem, Fanos's inequ
LECTURE 2Convexity and related notionsLast time:Goals and mechanics of the class • notation • entropy: definitions and properties • mutual information: definitions and prop₭ erties Lecture outlineConvexity and concavity • Jensen’s inequality • Positivity of mutual information • Data processing theorem • Fano’s inequality • Reading: Scts. 2.6-2.8, 2.11. ConvexityDefinition: a function f(x) is convex over (a, b) iff ∀x1,x2 ∈ (a, b) and 0 ≤ λ ≤ 1 f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f(x2) and is strictly convex iff equality holds iff λ =0 or λ = 1. f is concave iff −f is convex. Convenient test: if f has a second derivatiiv that is non-negative (positive) everywhher , then f is convex (strictly convex)Jensen’s inequalityif f is a convex function and X is a r.v., then EX[f(X)] ≥ f(EX[X]) if f is strictly convex, then EX[f(X)] = f (EX[X]) X = E[X].⇒ Concavity of entropy Let f(x)= −x log(x) then 1 f�(x)= −x log(e) x − log(x) = − log(x) − log(e) and 1 f”(x)= − log(e) < 0x for x> 0. f(PX(x))H(X)= �x∈|X |thus the entropy of X is concave in thevalue of PX(x) for every x.Thus, consider two random variables, X1and X2 with common X . Then the randdo variable X defined over the samesuch that PX(x)= λPX1(x) + (1 − λ)PX2(x)satisfies:H(X) ≥ λH(X1) + (1 − λ)H(X2).X Maximum entropyConsider any random variable X11 on X . For simplicity, consider X = {1,..., |X |} (we just want to use the elements of X as indicces) Now consider X21 a random variabbl such PX1(x)= PX1(shift(x)) where 21 shift denotes the cyclic shift on (1,..., X ). Clearly H(X11) = H(X21). Moreover, consiide X12 defined over the same X such that PX2(x)= λPX1(x) + (1 − λ)PX1(x) then 11 2 H(X12) ≥ H(X11). We can show recursively with the obvious extension of notation that H(X1 n) ≥ H(X1 m) ∀n>m ≥ 1. Now limn→∞PX1 n(x)= |X 1 |∀x ∈X . Hence, the uniform distribution maximizes entropy and H(X) ≤ log(|X |). Conditioning reduces entropyH(YX)= EZ[H(YX = Z)] = − �|PX(x) � PY ||X(y|x)log2[PY |X(y|x)] x∈X y∈Y PY (y)= �x∈X PX(x)PY |X(y|x) hence by concavvit H(Y |X) ≤ H(Y ). Hence I(X; Y )= H(Y ) − H(Y |X) ≥ 0. Independence bound: H(X1, . . . , Xn) n= � H(Xi|X1, . . . , Xi−1) i=1 n� H(Xi)≤ i=1Question: H(Y |X = x) ≤ H(Y )? Mutual information and transitionprobabilityLet us call PY X(yx) the transition proba||bility from X to Y . Consider a r.v. Z that takes values 0 and 1 with probability θ and 1 − θ and s.t. PY |X,Z(y|x, 0) = PY�|X(y|x) PY |X,Z(y|x, 1) = PY��|X(y|x) and Z is independent from X I(X;(Y, Z)) = I(X; Z)+ I(X; YZ)|and I(X;(Y, Z)) = I(X; Y )+ I(X; ZY )|hence I(X; Y |Z) ≥ I(X; Y ) so θI(X; Y |Z = 0)+(1−θ)I(X; Y |Z = 1) ≥ I(X; Y ) For a fixed input assignment, I(X; Y ) is convex in the transition probabilities X Mutual information and inputprobabilityConsider a r.v. Z such that PXZ(x0) = ||P �(x), PXZ(x1) = P ��(x), Z takes values 0 ||and 1 with probability θ and 1 − θ and Zand Y are conditionally independent, givenI(Y ; ZX)=0|and I(Y ;(Z, X)) = I(Y ; Z)+ I(Y ; XZ)|= I(Y ; X)+ I(Y ; ZX)|so I(X; Y |Z) ≤ I(X; Y ).Mutual information is a concave function of the input probabilities. Exercise: jamming game in which we try to maximize mutual information and jammme attempts to reduce it. What will the policies be? Markov chainMarkov chain: random variables X, Y, Z form a Markov chain in that order XY Z if the joint PMF →→can be written asPX,Y,Z(x, y, z)= PX(x)PY X(yx)PZY (zy).||||Markov chainConsequences: XY Z iff X and Z are conditionally • →→independent given Y PX,ZY (x, z|y)|PX,Y,Z(x, y, z) = PY (y) PX,Y (x, y) = PZY (zy)PY (y) ||= PXY (xy)PZY (zy)||||so Markov implies conditional independeenc and vice versa XY ZZY X (see above • →→⇔→→LHS and last RHS) Data Processing Theorem If XY Z then I(X; Y ) ≥ I(X; Z)→→ I(X; Y, Z)= I(X; Z)+ I(X; YZ)|I(X; Y, Z)= I(X; Y )+ I(X; ZY )|X and Z are conditionally independent given Y , so I(X; ZY )=0 |hence I(X; Z)+ I(X; YZ)= I(X; Y ) so|I(X; Y ) ≥ I(X; Z) with equality iff I(X; Y |Z)= 0 note: XZY I(X; YZ)=0 Y →→⇔ |depends on X only through Z Consequence: you cannot ”undo” degradattio Consequence: Second Law ofThermodynamicsThe conditional entropy H(XnX0) is nondecreasing as n increases for a stationary Markov process X0, . . . , Xn Look at the Markov chain X0 → Xn−1 →Xn DPT saysI(X0; Xn−1) ≥ I(X0; Xn)H(Xn−1)−H(Xn−1|X0) ≥ H(Xn)−H(Xn|X0) so H(Xn−1|X0) ≤ H(Xn|X0) Note: we still have that H(Xn|X0) ≤ H(Xn). � � Fano’s lemmaSuppose we have r.v.s X and Y , Fano’s lemma bounds the error we expect when estimating X from Y We generate an estimator of X that is �=X g(Y ). Probability of error Pe = Pr( �= X)X Indicator function for error E which is 1 when X = X and 0 otherwise. Thus, Pe = P (E = 0)Fano’s lemma:H(E)+ Pe log(|X| − 1) ≥ H(X|Y )Proof of Fano’s lemmaH(E,XY )|= H(XY )+ H(EX, Y )||= H(XY )|H(E,XY )|= H(EY )+ H(XE,Y )||H(E|Y ) ≤ H(E) H(XE,Y )|= PeH(X|E = O, Y ) + (1 − Pe)H(X|E =1,Y ) = PeH(XE = O, Y )|≤ PeH(X|E = O) ≤ Pe log(|X| − 1) MIT OpenCourseWarehttp://ocw.mit.edu 6.441 Information Theory Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Description
The outline of this lecture notes are 1.Convexity and concavity,2.
Jensen’s inequality, 3. Positivity of mutual information, 4. Data processing theorem and 4.Fano’s inequality. Various sub topics discussed under this section are Maximum entropy, Conditioning reduces entropy,Mutual information and transition probability,Mutual information and input
probability, Markov chain,Data Processing Theorem,Consequence: Second Law of Thermodynamics,Fano’s lemma and Proof of Fano’s lemma
Instructors: Prof. Muriel Médard ,MIT Course Number:6.441 Level: Graduate,6.441-2 Jensen's inequality, Data processing theorem, Fanos's inequality, Information Theory,Electrical Engineering and Computer Science, Engineering, Massachusetts Institute of Technology: MIT Open Course Ware, http://ocw.mit.edu (11-09-2011).License: Creative Commons BY-NC-SA: http://ocw.mit.edu/terms/#cc".
Presentation Transcript
Your Facebook Friends on WizIQ