VI. Convergence of Random Variables Class Notes for Math 6601 Pedro Juan Rodríguez Esquerdo Department of Mathematics PO Box 23355 University of Puerto Rico Río Piedras, Puerto Rico 00931 prodrig@upracd.upr.clu.edu Table of Contents VI. Convergence of Random Variables........................................................................ 2 A. Weak Convergence ...................................................................................................................2 B. Convergence in Probability........................................................................................................4 C. Almost Sure Convergence ........................................................................................................8 D. Limits of Moment Generating Functions....................................................................................9 D. The Central Limit Theorem......................................................................................................10 Cautionary note: This is preliminary work still in progress. As such it may contain errors and must therefore be read carefully, with the appropriate caution. The errors may be typographical, substantive, logical, or otherwise. The only person responsible for them is the author who will receive any and all comments about this material. VI. Convergence of Random Variables1 Study of the convergence of random variables is central in probability theory and in statistics. In practice, the conditions under which the distribution of a sequence of random variables can be approximated by another distribution or under which the usual results on the random variables are conserved by the limiting distribution. Three main modes of convergence are studied here: weak convergence, also called convergence in distribution or convergence in law. This mode states the conditions under which a sequence of distribution functions converges to a distribution function. Another mode is convergence in probability, which studies the limiting behavior of the probability that a sequence of random variables deviates by more than a given quantity from a limiting random variable. Finally, convergence with probability one, or almost sure convergence is studied. This mode studies the conditions under which two random variables may be equal except possibly on a set with probability zero. The relation between each mode of convergence is also studied. A. Weak Convergence Consider a sequence of distribution functions { Fn} defined by Fn(x) = 0 if x < n and Fn(x) = 1 if x ≥ n. Then Fn(x) 0 as n ∞ which is not a distribution function. This shows that a sequence of distribution functions can converge to a function which is not a distribution function. On the other hand consider another sequence of distribution functions defined as follows Example VI.1 Let X1, X2, …., Xn be independent identically distributed random variables with common probability density function ⎪⎩ ⎪⎨ ⎧ ∞ < < < = otherwise x x f 0 0 ) ( 1 θ θ . Consider the maximum Mn = max(X1, X2, …, Xn) of these random variables, its probability density function is ⎪ ⎪⎩ ⎪⎪⎨ ⎧ < < = − otherwise x nx x g n n 0 0 ) ( 1 θ θ , and its distribution function is ⎪ ⎪ ⎪ ⎩⎪ ⎪ ⎪ ⎨⎧ ≥ < ≤ ⎟⎠ ⎞ ⎜⎝ ⎛ < = θ θ θ x x x x G n n x1 0 0 0 ) ( . As ⎩ ⎨ ⎧≥< = = ∞ → θθ xx x F x F n n 10 ) ( ) ( , that is, the sequence of distribution functions Fn converges to a distribution function F at every point where F is continuous.■ These two examples suggest a definition of the convergence of distribution functions: 1 See An Introduction to Probability Theory and Mathematical Statistics, V. K. Rohatgi, Chapter 6. Most of these notes is an adaptation of Rohatgi´s presentation. VI. Convergence of Random Variables page 3 PJ Rodríguez Esquerdo Math 6601 Definition VI-1 Let {Fn} be a sequence of distribution functions If there exists a distribution function F such that as n ∞, Fn F at every point in which F is continuous, we say that Fn converges weakly or in law to and write Fn ⎯→ ⎯w F. If {Xn} is a sequence of random variables and {Fn} is the corresponding sequence of distribution functions we say that Xn converges in distribution (or in law) to X if there exists a random variable X with distribution function F such that Fn ⎯→ ⎯w F and write Xn ⎯→ ⎯L X. Weak convergence is redundantly, sort of weak, as the following examples show. Example VI.2 Convergence in distribution does not imply convergence of moments Let {Fn} be a sequence of distribution functions defined by ⎪ ⎪ ⎪ ⎩⎪ ⎪ ⎪ ⎨⎧ ≤ < ≤ < = − x n n x x x F n n 1 0 0 0 ) ( 1 1 , it is easy to see that , F F w n ⎯→ ⎯ , where ⎩ ⎨ ⎧≥< = 0 1 0 0 ) ( xx x F . The function Fn is the distribution function of the random variable Xn with probability function n n X P n X P n n 1 } { , 1 1 } 0 { = = − = = , On the other hand, F is the distribution function of a random variable X which is degenerate at 0. The kth moment of Xn is E(Xnk) = nk-1, while the kth moment of X is E(Xk) = 0, so that E(Xnk) does not converge to E(Xk) for any k.■ Example VI.3 Convergence in distribution does not imply convergence of their probability functions or probability density functions. Let {Xn} be a sequence of random variables with probability functions ⎪⎩ ⎪⎨ ⎧+ = = = = otherwisen x if x X P x f n n 0 1 2 1 } { ) ( . None of the probability functions assigns any probability to the point x = 2. Then fn(x) f(x) as n ∞, where f(x) = 0 for all real numbers x. But the sequence of distribution functions {Fn} of the random variables Xn converges weakly to the function ⎩ ⎨ ⎧≥< = 2 1 2 0 ) ( xx x F at all continuity points of F. Since F is a distribution function, Fn ⎯→ ⎯w F.■ Theorem VI-1 Discrete case Let Xn be a sequence of integer-valued random variables and the sequence fn(k) = P{ Xn = k} for k = 0, 1, 2, … be their respective probability functions, n = 1, 2, …. Additionally, let X be a random variable with probability function f(k) = P{X = k}. Then fn(k) f(k) for all x if and only if Xn ⎯→ ⎯L X. Proof: Homework. Use the definition of weak convergence and Example VI.3. For the continuous case we need some more conditions. VI. Convergence of Random Variables page 4 PJ Rodríguez Esquerdo Math 6601 Theorem VI-2 Continuous case Let Xn, n = 1, 2, ..and X be continuous random variables such that as n ∞, fn(x) f(x) for all real x, ,except perhaps for x in a set with probability zero. Then Xn converges in law to X. Proof: See Scheffé, A useful convergence theorem for probability distributions, Ann. Math. Stat. 10 (1947), p. 434-438. Theorem VI-3 Let Xn, n = 1, 2, .. be a sequence of random variables such that Xn ⎯→ ⎯L X and let c be a real constant. Then Xn + c ⎯→ ⎯L X + c and cXn ⎯→ ⎯L cX, c ≠0. Proof: Homework. Use the definition of weak convergence and Example VI.3. B. Convergence in Probability Convergence in probability is a slightly stronger concept than weak convergence. Definition VI-2 Let {Xn} be a sequence of random variables defined on a probability space (Ω, F, P). Xn is said to converge in probability to the random variable X if for every ε > 0, the sequence of real numbers pn = P{ |Xn – X| > ε} 0 as n ∞. We write Xn ⎯→ ⎯p X. Remark VI-1 Convergence in probability of Xn to the rv X does not imply the convergence of functions in the real analysis sense. That is, convergence in probability of Xn to the random variable X does not imply that given ε > 0, we can find N such that |Xn – X| < ε for n ≥N. Its definition only refers to the convergence of the sequence of (real numbers) probabilities pn to 0. Example VI.4 Let { Xn} be a sequence of random variables with probability function given by P{ Xn = 1} = 1/n and P{ Xn = 0} = 1 -1/n. Then ⎪⎩ ⎪⎨ ⎧ ≥ < < = = = > 1 0 1 0 1 } 1 { } | {| ε ε ε n X P X P n n . Thus P{ |Xn| > ε} 0 as n ∞ and it can be concluded that Xn ⎯→ ⎯p 0.■ Theorem VI-4 Weak Law of Large Numbers (1) Suppose X1, X2, … is a random sample from a distribution for which the mean is μ and for which the variance σ2 exists. Then p n X μ ⎯⎯→ . Proof Chebyshev inequaility is applied to this situation, then for every ε > 0, VI. Convergence of Random Variables page 5 PJ Rodríguez Esquerdo Math 6601 2 2 1 ) | (| ε σ ε μ n X P n − ≥ > − , so that 1 ) | (| lim = > − ∞ → ε μ n n X P .■ This result has numerous applications and validates the use of observed frequencies when estimating the probability of success of a Bernoulli experiment. Suppose X is a random variable with a Ber(p) distribution in which the parameter p, 0 ≤ p ≤ 1 is unknown. For the random variable P{ X = 1} = p = 1 – P{ X = 0}, its expect3d value an variance are respectively, E(X) = p and Var(X) = p(1 – p). We might carry out an experiment, repeating n independent Bernoulli trials, observing the number of successes f out those n trials. The parameter p may be estimated by f/n. If n is sufficiently large, one can safely assume that f/n will be close to the real unknown value p because of the weak law of large numbers. Figure VI-1 shows the current or cumulative value of f/n after each trial. It can be seen that despite much initial variation, the value of f/n tends to stabilize near .5 as expected. The weak law of large numbers states that for ε > 0, P{ |f/n -.5| > ε } tends to zero as n ∞. This does not mean that the sequence f/n .5 in the real analysis sense, it does not mean that given an ε > 0, there exists an N such that |f/n -.5| < ε for n > N. The strong law of large numbers, studied later, proposes a much stronger result. Figure VI-1 Simulation of the Probability of Observing Heads in 1,500 Tosses of a Coin Observed Probability of Heads in 1,500 Tosses 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.001 101 201 301 401 501 601 701 801 901 1001 1101 1201 1301 1401 Toss Probability Theorem VI-5 Weak Law of Large Numbers (2) Let {Xn} be a sequence of pairwise uncorrelated random variables with E(Xi) = μi and Var(Xi) =σi2 , i = 1, 2, … If ∞ → ∞ → ∑ = n as ni i 1 2 σ , then ∞ → ⎯→ ⎯ − ∑ ∑ = = n as X p n i ni ii i 0 1 1 2 σμ . Proof: Homework. Use Chebyshev’s inequality. The following theorem generalizes the Weak Law if Large Numbers in that t only requires the existence of the first moment of the random variable, nothing is assumed about the variance. VI. Convergence of Random Variables page 6 PJ Rodríguez Esquerdo Math 6601 The next theorem shows that the sample moments converge in probability to the moments of the random variables. Theorem VI-6 Convergence in Probability of the Sample Moments Let {Xn} be a sequence of independent identically distributed random variables with E|X1|k < ∞ for some positive integer k. Then ∞ → ⎯→ ⎯ ∑= n as X E n X k p n j kj) ( 1 1 . Example VI.5 Convergence of the Sample Variance From Theorem VI-6 if E(X12) < ∞, then ) ( 1 2 1 1 2 X E X n p nj j ⎯→ ⎯ ∑ = . We also have that 2 1 2 1 ) (X E n X p nj j ⎯→ ⎯ ⎟⎠ ⎞ ⎜⎝ ⎛∑ = . Then ) ( ) ( ) ( 1 ) ( 1 1 2 1 2 1 1 2 1 2 1 2 X Var X E X E n X X n X X n p nj nj j j nj j = − ⎯→ ⎯ ⎟⎠ ⎞ ⎜⎝ ⎛ − = − ∑ ∑ ∑ = = = .■ Remark VI-2 Convergence in probability under diverse circumstances 1. Xn ⎯→ ⎯p X if and only if Xn – X ⎯→ ⎯p 0. 2. If Xn ⎯→ ⎯p X and Xn ⎯→ ⎯p Y, then P{ X = Y} = 1. Proof: Let c be a real number, then P{|X – Y| > c} ≤ P{|Xn – X| > c/2} + P{|Xn – Y| > c/2} = 0■ 3. If Xn ⎯→ ⎯p X, then Xn – Xm ⎯→ ⎯p 0, as n, m ∞. The proof is similar to that of result 2. above. 4. If Xn ⎯→ ⎯p X and Yn ⎯→ ⎯p Y, then Xn + Yn ⎯→ ⎯p X + Y, and Xn -Yn ⎯→ ⎯p X – Y. 5. If Xn ⎯→ ⎯p X and k is a constant then kXn ⎯→ ⎯p kX. 6. If Xn ⎯→ ⎯p k, and k is a constant, then Xn2 ⎯→ ⎯p k2. 7. If Xn ⎯→ ⎯p a and Yn ⎯→ ⎯p b; a, b real constants, then Xn Yn ⎯→ ⎯p ab. Proof: Use the fact that ab b a b a Y X Y X Y X p n n n n n n = − − + ⎯→ ⎯ − − + = 4 ) ( ) ( 4 ) ( ) ( 2 2 2 2 .■ 8. If Xn ⎯→ ⎯p 1 then 1/Xn ⎯→ ⎯p 1. Proof: Consider } 1 1 { } 1 1 { } | 1 1 {| ε ε ε − ≤ + + ≥ = ≥ − n n n X P X P X P . 9. If Xn ⎯→ ⎯p a and Yn ⎯→ ⎯p b; a, b real constants, b ≠ 0, then Xn/Yn ⎯→ ⎯p a/b. 10. If Xn ⎯→ ⎯p X and Y is a random variable then XnY ⎯→ ⎯p XY. 11. If Xn ⎯→ ⎯p X and Yn ⎯→ ⎯p Y, then Xn Yn ⎯→ ⎯p X Y.■ Theorem VI-7 Let Xn ⎯→ ⎯p X and g be a continuous function defined on R. Then g(Xn) ⎯→ ⎯p g(X), as n ∞. Proof. VI. Convergence of Random Variables page 7 PJ Rodríguez Esquerdo Math 6601 Since X is a random variable, for a given ε > 0 we can find a constant k = k(ε) such that P{|X| > k } < ε/2. Also, since g is continuous then g is uniformly continuous on the closed interval [ -k, k ]. It then follows that there exists a number δ = δ(ε, k) such that |g(xn) – g(x)| < ε whenever |x| ≤ k and |xn – x| < δ. Let A = {|X| ≤ k}, B = {|Xn – X| < δ } and C = {|g(Xn) – g(X)| < ε}. Then ω ∈ A∩B implies that ω ∈ C, so that A∩B ⊆ C or equivalently Cc ⊆ Ac ∪ Bc. Thus P{Cc} ≤ P{Ac} + P{Bc}. In other words, P{|g(Xn) – g(X)| ≥ ε} ≤ P{|xn – x| ≥ δ} + P{|X| > k} < ε for n ≥ N(ε, δ, k), where N(ε, δ, k) is chosen so that P{|xn – x| ≥ δ} < ε/2 for n ≥ N(ε, δ, k).■ Definition VI-3 Limit Superior and Limit Inferior Let {An} be a sequence of sets. The sets of all point ω ∈ Ω that belong to An for infinitely many values of n is called the limit superior of the sequence, denote by lim sup An. The set of all point that belong to An for all but a finite number of values of n is the limit inferior of the sequence {An} and is denoted by lim inf An. Remark VI-3 1. If lim inf An.= lim sup An, we say that the limit exists for the common set and call it the limit set. 2. We have n n n n k k n n k k n n A A A A ∞ → ∞= ∞= ∞= ∞= ∞ → = ⊆ = lim lim 1 1 ∩∪ ∪∩ . 3. A sequence {An} such that An ⊆ An+1, for n = 1, 2, …., is called nondecreasing, while if An+1 ⊆ An, for n = 1, 2, …., is called nonincreasing. Theorem VI-8 relationship between convergence in probability and convergence in distribution. If Xn ⎯→ ⎯p X then Xn ⎯→ ⎯L X. Proof: Let Fn and F be the distribution functions of Xn and X respectively. Then {ω; X(ω) ≤ x’} = {ω; Xn(ω) ≤ x, X(ω) ≤ x’} ∪ {ω; Xn(ω) > x, X(ω) ≤ x’} ⊆ {ω; Xn(ω)≤ x} ∪ {Xn > x, X(ω) ≤ x’}. Then F(x’) ≤ Fn(x) + P{Xn > x, X(ω) ≤ x’}. Since Xn ⎯→ ⎯p X, we have for x’ < x that P{Xn > x, X(ω) ≤ x’} ≤ P{|Xn – X| > x – x’} 0 as n ∞. Thus Fn(x’) ≤ lim sup Fn(x) x’ < x. Therefore for x’ < x < x’’, F(x’) ≤ lim inf Fn(x) ≤ lim sup Fn(x) ≤ F(x’’). We choose x to be a point of continuity since F may only have a countable number of discontinuities and we let x’’ increase to x, and x’ decrease to x. Then lim Fn(x) = F(x) at all points of continuity.■ Theorem VI-9 VI. Convergence of Random Variables page 8 PJ Rodríguez Esquerdo Math 6601 Let k be a real constant, if Xn ⎯→ ⎯L k then Xn ⎯→ ⎯p k. Proof: Omitted. Example VI.6 Weak convergence in general does not imply convergence in probability. Let X, X1, X2,… be independent identically distributed random variables and let the joint distribution of (Xn,X) be given by the following table: Xn 0 1 0 0 ½ ½ X 1 ½ 0 ½ ½ ½ 1 It is easy to see that X X L n ⎯→ ⎯ , but P{|Xn – X| > ½ } ≥ P{|Xn – X| =1 } = P{Xn = 0, X = 1} = 1, thus Xn does not converge in probability to X.■ Remark VI-4 Xn ⎯→ ⎯p X does not imply that E(Xnk) E(Xk) for any positive integer k. Theorem VI-10 Let {Xn, Yn} n = 1, 2, … be a sequence of pairs of random variables. If [Xn – Yn| ⎯→ ⎯p 0 and Yn ⎯→ ⎯L Y, then Xn ⎯→ ⎯L Y. Proof: Omitted. Corollary VI-1 If Xn ⎯→ ⎯p X then Xn ⎯→ ⎯L X, Proof: Omitted. Remark VI-5 Let {Xn, Yn} n = 1, 2, … be a sequence of pairs of random variables and let c be a real constant. Then 1. If Xn ⎯→ ⎯L X and Yn ⎯→ ⎯p c then Xn +-Yn ⎯→ ⎯L X ± c 2. Xn ⎯→ ⎯L X and Yn ⎯→ ⎯p c then XnYn ⎯→ ⎯L cX (c ≠ 0) and XnYn ⎯→ ⎯p 0 (c = 0) 3. Xn ⎯→ ⎯L X and Yn ⎯→ ⎯p c then Xn/Yn ⎯→ ⎯L X/c (c≠ 0) C. Almost Sure Convergence The results in this section will be stated without proof. They require deeper knowledge of Real Analysis and the use of an important result called the Borel-Cantelli Lemma which is beyond the scope of this course. The reader is referred to other references. Definition VI-4 Convergence with Probability 1 Let {Xn} be a sequence of random variables defined on a probability space (Ω, F, P). Xn is said to converge almost surely (a.s.) to the random variable X if and only if P{ω; Xn(ω) X(ω) as n ∞} =1. We also say that Xn converges to X with probability 1. Theorem VI-11 VI. Convergence of Random Variables page 9 PJ Rodríguez Esquerdo Math 6601 If the sequence of random variables {Xn} converges almost surely to X then it converges in probability to X. Theorem VI-12 Let {Xn} be a strictly decreasing sequence of positive random variables, and suppose that Xn ⎯→ ⎯p 0, then Xn ⎯→ ⎯ . .s a 0. Definition VI-5 Strong Law of Large Numbers A sequence of random variables {Xn} is said to obey the Strong Law of Large Numbers with respect to a sequence of constants {Bn} , Bn > 0 and Bn ∞, if there exists another sequence of constants {An} such that ∞ → ⎯→ ⎯ − ∑ = n as B A X s a n n nj j 0 . . 1 . Theorem VI-13 Strong Law of Large Numbers (1) If ∑ ∑ ∞= ∞= − ∞ < 1 1 )) ( ( then , ) ( n n n n n X E X X Var converges almost surely. The following version of the Strong Law of Large Numbers only requires that the common first moment be finite. No mention of the variance or other moments is needed. Theorem VI-14 Strong Law of Large Numbers (Kolmogorov) Let X1, X2, …. be a sequence of independent identically distributed random variables with common distribution function F. Then finite X n s a nj j μ ⎯→ ⎯ ∑ = . . 1 1 if and only if E(|X|) < ∞, and then μ = E(X). D. Limits of Moment Generating Functions Questions may arise where there is a sequence of random variables X1, X2, … with corresponding distribution functions F1, F2, …, and where the moment generating function Mn(t) of Fn exists. The conditions under which Mn(t) converges to a moment generating function are studied here. Example VI.7 Let {Xn} be a sequence of random variables with probability function P{ Xn = -n} = 1, for n = 1, 2, 3, …. Then a. Mn(t) = E(exp(tXn)) = e-tn 0 as n ∞ for all t > 0. b. Mn (t) ∞ for all t < ∞, and c. Mn (t) 1 at t=0. Thus Mn(t) converges to a function which is not a moment generating function. The distribution function Fn of Xn is given by Fn(x) = 0 if x < -n; and Fn(x) = 1 if x ≥ -n which converges to the function F(x) = 1 for all x, which in turn is not a distribution function. Moreover even in the case when the sequence of random variables with existing moment generating functions converge in law to a random variable X whose moment generating function exists, the sequence of moment generating functions do not need to converge to the moment generating function of X. The conditions for convergence of a sequence of moment generating functions to another moment generating function are given below■ Theorem VI-15 Continuity Theorem Let {Fn} be a sequence of distribution functions with corresponding moment generating functions Mn(t), and suppose that Mn(t) exists for |t| ≤ t0 for every n. If there exists a distribution VI. Convergence of Random Variables page 10 PJ Rodríguez Esquerdo Math 6601 function F with corresponding moment generating function M which exists for |t| ≤ t1 < t0, such that Mn(t) M(t) as n ∞ for every t ∈ [-t1, t1], then Fn converges weakly to F. Example VI.8 Convergence of the Binomial Distribution to the Poisson Distribution Let X1, X2, …. be independent identically distributed B(1,p) random variables. Let Sn =( X1 + X2 +… Xn), and let Mn(t) be the moment generating function of Sn. Then Mn(t) = (1 – p + pet)n for all t. If we let n ∞ such that np remains constant at say λ, then, { } )} 1 ( exp{ ) 1 ( 1 1 ) ( − → − + = ⎟⎠ ⎞⎜⎝ ⎛+ − = t n t n t n e e n e n n t M λ λ λ λ , which is the moment generating function of a Poisson random variable.■ Example VI.9 Convergence of the Poisson Distribution to the Normal Distribution Let X be a random variable with the P(λ) distribution. Its moment generating function was shown in Example VI.8 above. Standardize X into λλ − = X Y . The moment generating function of Y is then ) ( ) ( λ λ t M e t M t Y − = . Also, ) 1 ( log )) ( log( − + − = ⎟⎠ ⎞ ⎜⎝ ⎛ − = λ λ λ λ λ t Y e t t M t t M . 2 ... ! 3 2 ... ! 3 2 2 2 1 3 2 2 3 3 2 ∞ → → + + = ⎟ ⎟⎠ ⎞ ⎜ ⎜⎝ ⎛ + + + + − = λ λ λ λ λ λ λ as t t t t t t t So that log MY(t) t2/2 as λ ∞, and therefore MY(t) exp(t2/2) as λ ∞, which is the moment generating function of a standard normal random variable.■ This result is a special case of the convergence of the distribution of a random variable to the normal distribution. More general conditions for convergence to the normal distribution will be studied below. D. The Central Limit Theorem