PROBLEMS IN OPTIMISATION 7 INFINITE HORIZON DYNAMIC PROGRAMMING 1. Statement of the problem Definition 1.1. Stationary Discounted Programming Problem, SDP There are six elements required to define this class of dynaami programming problem. {S, A, , r, f, } (1) The state space S. (2) The action space A. (3) The feasible action correspondence : S ! P(A) (4) The reward function r : S × A ! R (5) The transition function f : S × A ! S (6) The one-period discount factor 2 [0, 1). The optimisation problem is: max (a0,a1,...) 1 Xt=0 tr(st, at) s.t. s0 = s, st = f(st−1, at−1), t = 1, 2, ..., at 2 (st), t = 0, 1, 2... Example 1.2. S = {s 2 R|s 0}, A = {a 2 R|a 0}, r(s, a) = pa, f(s, a) = s − a, The discount factor 2 [0, 1). The optimisation problem is: max (a0,a1,...) 1 Xt=0 tr(st, at) 12 7 INFINITE HORIZON DYNAMIC PROGRAMMING = max (a0,a1,...) 1 Xt=0 tpat s.t. s0 = s, st = f(st−1, at−1), t = 1, 2, ..., at 2 (st), t = 0, 1, 2... Example 1.3. S = {1,−1} A = {0, 1} (s) = {0, 1} for every s. r(s, a) = 8>><>>: 0 if s = 1, a = 1, 2 if s = 1, a = 0, −1 if s = −1, a = 1, 0 if s = −1, a = 0 f(s, a) = 1 if a = 1, −1 if a = 0 The discount factor 2 [0, 1) Note that (1) r(1, 1) = 0, r(1, 0) = 2, r(−1, 1) = −1, r(−1, 0) = 0 (2) f(s, 1) = 1, f(s, 0) = −1, 8s Let s0 = 1 (s0) = {0, 1} Let a0 = 0 r(s0, a0) = r(1, 0) = 2 s1 = f(s0, a0) = f(1, 0) = −1 Let a1 = 1 r(s1, a1) = r(−1, 1) = −1 s2 = f(s1, a1) = f(−1, 1) = 1 ............... Definition 1.4. 8.2 (t-histories) A t-history for the SDP is a finite sequence: ht = (s0, a0, ..., st−1, at−1, st)OPTIMIZATION THEORY 3 which describes the sequence of states and actions the systte has gone through to arrive at st from the initial state s0. We denote the time t state st of a history ht by st[ht]. Example 1.5. In example 1.3 h0 = (s0) = (1), s0[h0] = s0 = 1 h1 = (s0, a0, s1) = (1, 0,−1), s1[h1] = s1 = −1 h2 = (s0, a0, s1, a1, s2) = (1, 0,−1, 1, 1), s2[h2] = s2 = 1 Definition 1.6. 8.3 (The history space Ht) For t = 0, define H0 = S. For all other t = 1, 2, ..., define Ht to be the collection of all possible t-histories. The Ht are referred to as history spaces. Definition 1.7. 8.4 (SDP strategies) A strategy for the SDP is an infinite sequence (t)1t=0 of functions t : Ht ! A such that: t(ht) 2 t(st[ht]) The strategy space P of the SDP is the set of all strategiie for it. For a given SDP with initial state s, we can consider what happens when we follow some particular strategy . The strategy prescribes an action at each time t = 0, 1, ..., which in turn generates an infinite sequence of rewards period t rewards: rt(, s) = r(st(, s), at(, s)) Example 1.8. In example 1.5, for t = 0, 1..., t(ht) 2 t(st[ht]) = {0, 1} Therefore 2 {0, 1}1 (1) (0) = (0, 0, ....) is a strategy. This means at = 0, 8t Let s0 = 1 r(s0, a0) = r(1, 0) = 24 7 INFINITE HORIZON DYNAMIC PROGRAMMING s1 = f(s0, a0) = f(1, 0) = −1 a2 = 0 r(s1, a1) = r(−1, 0) = 0 s2 = f(s1, a1) = f(−1, 0) = −1 ..... h0 = (s0) = (1), s0[h0] = s0 = 1 h1 = (s0, a0, s1) = (1, 0,−1), s1[h1] = s1 = −1 h2 = (s0, a0, s1, a1, s2) = (1, 0,−1, 0,−1), s2[h2] = s2 = −1 ..................... Let s0 = −1 r(s0, a0) = r(−1, 0) = 0 s1 = f(s0, a0) = f(−1, 0) = −1 a2 = 0 r(s1, a1) = r(−1, 0) = 0 s2 = f(s1, a1) = f(−1, 0) = −1 ..... h0 = (s0) = (−1), s0[h0] = s0 = −1 h1 = (s0, a0, s1) = (−1, 0,−1), s1[h1] = s1 = −1 h2 = (s0, a0, s1, a1, s2) = (−1, 0,−1, 0,−1), s2[h2] = s2 = −1 (2) (1) = (1, 1, ....) is a strategy. This means at = 1, 8t Let s0 = 1 r(s0, a0) = r(1, 1) = 0 s1 = f(s0, a0) = f(1, 1) = 1 a1 = 1 r(s1, a1) = r(1, 1) = 0 s2 = f(s1, a1) = f(1, 1) = 1 ..... h0 = (s0) = (1), s0[h0] = s0 = 1 h1 = (s0, a0, s1) = (1, 1, 1), s1[h1] = s1 = 1OPTIMIZATION THEORY 5 h2 = (s0, a0, s1, a1, s2) = (1, 1, 1, 1, 1), s2[h2] = s2 = 1 ..................... Let s0 = −1 r(s0, a0) = r(−1, 1) = −1 s1 = f(s0, a0) = f(−1, 1) = 1 a2 = 1 r(s1, a1) = r(1, 1) = 0 s2 = f(s1, a1) = f(1, 1) = 1 ..... h0 = (s0) = (−1), s0[h0] = s0 = −1 h1 = (s0, a0, s1) = (−1, 1, 1), s1[h1] = s1 = 1 h2 = (s0, a0, s1, a1, s2) = (−1, 1, 1, 1, 1), s2[h2] = s2 = 1 Definition 1.9. (The value of a strategy) The value W()(s) of a strategy , given an initial state s, is the total discounted reward under : W()(s) = 1 Xt=0 trt(, s) Our maximization problem is then to determine the strateeg which maximises this quantity for a given initial state s. Example 1.10. S = {s 2 R|s 0}, A = {a 2 R|a 0}, r(s, a) = pa, f(s, a) = s − a, The discount factor 2 [0, 1). consider the strategy (0) in which at = 0 for all t r(s, a) = r(s, 0) = 0 f(s, a) = f(s, 0) = s W((0))(s) = 1 Xt=0 trt((0), s) = 06 7 INFINITE HORIZON DYNAMIC PROGRAMMING Definition 1.11. (The value function) The value function V : S ! R of the SDP is V (s) = sup 2PW()(s) Lemma 1.12. 8.1 Suppose that for a given SDP with initial state s0, there exists a K > 0 such that: |r(s, a)| K, 8(s, a) 2 S × A Then the value V (s) exists for all s 2 S. Definition 1.13. 8.7 (Optimal strategies) An optimal strategy for the SDP is a strategy such that: W()(s) = V (s) 2. 8.7 Theorem 2.1. 8.1 (The (Bellman) Principle of Optimality) Suppose we are given an SDP such that the value function V : S ! R is well-defined. Then for each s 2 S the value function satisfies the following Bellman equation V (s) = sup a2(s) {r(s, a) + V (f(s, a))} Theorem 2.2. Suppose that for a given SDP with initial state s0, there exists K > 0 such that |r(s, a)| K for all (s, a) 2 S × A. Then a strategy 2 P is optimal strategy if and only if for all s 2 S we have: W()(s) = supa2(s) {r(s, a) + W()(f(s, a))}. Definition 2.3. (Stationary strategies) A stationary strategy is a strategy in the sense of Definitiio 1.11 which only depends on the state of the program. Hence: t(ht) t(st[ht]), 8t = 0, 1, .... An optimal strategy which is also a stationary strategy is called a stationary optimal strategy.OPTIMIZATION THEORY 7 Theorem 2.4. 8.3 Suppose we are given an SDP with bounded reward functiion Then, under suitable conditions, this program has a stationary optimal strategy . Moreover, satisfies, for each s 2 S: W()(s) = maxa2(s) {r(s, a) + W()(f(s, a))} Example 2.5. 8.1 Suppose we are given the SDP: S = {s 2 R|s 0}, A = {a 2 R|a 0}, r(s, a) = pa, f(s, a) = s − a, (s) = {a 2 R|0 a s}, with s0 = W > 0 and discount factor 2 [0, 1). This problem is equivalent to the problem: max(a0,al,...)P1i=1 tpat , s.t. s0 = W, st = st−1 − at−1, t = 1, 2, ... , 0 at st, t = 0, 1, ... The problem is also equivalent to: max(a0,al,...) tpat, s.t. P1i=1 at W, at 0, t = 0, 1, .... Solution Let (0) be the strategy, which dictates that a = 0 for all s 2 S. r(s, 0) = p0 = 0 W((0))(s) = P1t=0 tr(s, 0) = 0. Thus, we have to maximise over a 2 [0, s], r(s, a)+W((0))(f(s, a)) = pa + × 0 = pa Maximum of this over a 2 [0, s] is ps. This then gives us our next strategy (1) which dictates taking a = s for all s 2 S. Then f(s, a) = 0, at one period in time, say t = . We8 7 INFINITE HORIZON DYNAMIC PROGRAMMING would be left with nothing in the following periods, forcing at = 0, for t > . Thus, this strategy has value: W((1))(s) = ps +P1t=1 × 0 = ps. The maximum over a 2 (s) = [0, s] of: r(s, a)+W((1))(f(s, a)) = pa+W((1))(s−a) = pa+ ps − a. This maximum must occur either at an interior point, which can be found by calculus, since the function is clearly differenttiable or at the endpoints a = 0 or a = s. Writing g(a) = pa + ps − a, the first-order condition gives: 0 = g0(a) = 1 2pa − 2ps−a ) a= s 1+2 , where g(a) = p(1 + 2)s > g(s) = ps > g(0) = ps. So the maximum does indeed occur at the interior point a. Thus: maxa2(s) r(s, a) + W((1))f(s, a)= p(1 + 2)s and occurs by taking, for every s 2 S, the action a = s 1+2 . A revised guess at the optimal strategy might be of the form: for all s 2 S, take the action a = ks, for some k 2 (0, 1). The value of this strategy should be W()(s) = Mps for some M > 0. If this is to be true, let us see if we can find some k for which this strategy is guaranteed to be optimal. Suppose we are in initial state s0 = s, for some s 2 S. Our strategy dictates we take the action a = ks. In the next time period, our state will be s1 = s0 −a0 = (1−k)s. From the paragraph above, we are assuming that the discouunte rewards from future periods, under this strategy , sum to: W()(s1) = Mp(1 − k)s. We must therefore have: Mps = W()(s) = pks+Mp(1 − k)s. On rearranging we can now relate our two constants k and M according to the relationship: pk = M(1 − p1 − k).OPTIMIZATION THEORY 9 To show that is optimal, we must also show that our hypothesis for W()(s) satisfies the Bellman equation theorre 2.4. That is, it must equal the maximum value over a 2 (s) = [0, s] of: r(s, a) + W()(f(s, a)) = pa + Mps − a. Denoting this function by g(a), we see that this attains its maximum eithhe at an interior point, which we can determine through first-order conditions, or at one of the end-points a = 0 and a = s. The first-order conditions for g yield: 0 = g0(a) = 1 2pa − M 2ps−a ) .ps − a = Mpa. This has just one solution, which our hypothesis would like to be a = ks. This would require: p1 − k = Mpk. We have therefore determined two equations relating k and M. so that, on solving for both, we are left with: k = 1 − 2,M = 1 1−2 . Thus, the Bellman equations are satisfied and we conclude that , which dictates a = (1 − 2)s for any s2S, is a statioonar optimal strategy for the SDP. The total reward for this strategy, given an initial state s, is given by s 1−2 , so that, for the initial state s0 = W, the value of this problem is: V (W) = W 1−2 Problem 2.6. An infinite-horizon stationary dynamic progrra with discount factor 2 [0, 1) has two states : S = {1,−1}. The action space also has only two elements A = {0, 1}, and the feasible actions are independent of the state, so (s) = {0, 1} for every s. The reward function is given by r(s, a) = 8>><>>: 0 if s = 1, a = 1, 2 if s = 1, a = 0, −1 if s = −1, a = 1, 0 if s = −1, a = 010 7 INFINITE HORIZON DYNAMIC PROGRAMMING And the transition function is f(s, a) = 1 if a = 1, −1 if a = 0 As a first attempt to find an optimal strategy for this progrram it is suggested to always take action 0, so set a(s) = 0 for all s. Call this strategy 0. (1) For initial state s0 = 1, find the states st and the actions at, t = 0, 1, ..., when following strategy 0. Do the same when the initial state is s0 = −1. (2) Determine the values W(0)(1) and W(0)(−1) of the strategy 0. (3) Prove that 0 is an optimal strategy if 1/2. (4) Find a strategy 1 that is better ( in the sense that the total discounted reward is larger ) than 0 for > 1/2. (You don’t have to prove it is optimal. ) Solution Note that (1) r(1, 1) = 0, r(1, 0) = 2, r(−1, 1) = −1, r(−1, 0) = 0 (2) f(s, 1) = 1, f(s, 0) = −1 (1) (a) Let s0 = 1 With the strategy 0, at = 0 for all t = 0, 1, .... s1 = f(s0, a0) = f(1, 0) = −1 For all t = 1, ..., st = −1 Thus in this case s0 = 1 and st = −1 for t = 1, 2, ... (b) Let s0 = −1 With the strategy 0, at = 0 for all t = 0, 1, .... s1 = f(s0, a0) = f(−1, 0) = −1 For all t = 1, ..., st = −1 Thus in this case s0 = 1 and st = −1 for t = 1, 2, ... (2) (a) Let s0 = 1 With the strategy 0, at = 0 for all t = 0, 1, ....OPTIMIZATION THEORY 11 r(s0, a0) = r(1, 0) = 2. For t = 1, 2, ..., st = −1 r(st, at) = r(−1, 0) = 0. W(0)(1) = P1t=0 tr(st, at) = r(s0, a0)+P1t=1 tr(st, at) = 2 +P1t=1 tr(−1, 0) = 2 + 0 = 2. (b) Let s0 = −1 With the strategy 0, at = 0 for all t = 0, 1, .... r(s0, a0) = r(−1, 0) = 0. For t = 1, 2, ..., st = −1 r(st, at) = r(−1, 0) = 0. W(0)(−1) = P1t=0 tr(st, at) = r(s0, a0)+P1t=1 tr(st, at) = 0 +P1t=1 tr(−1, 0) = 0 + 0 = 0. (3) To show that 0 is an optimal strategy, we must check if for all s 2 S we haveW(0)(s) = maxa2(s) {r(s, a) + W(0)(f(s, a))} W(0)(1) = 2 W(0)(−1) = 0. (a) Let s = 1. a 2 (1) = {0, 1} Let g(a) = r(1, a) + W(0)(f(1, a)) g(0) = r(1, 0)+W(0)(f(1, 0)) = 2+W(0)(−1) = 2 + 0 = 2 g(1) = r(1, 1)+W(0)(f(1, 1)) = 0+W(0)(1) = 2Since < 1, 2< 2, so the maximum is attained at a = 0, and gives the value 2. This is the same as W(0)(1), as required. (b) Let s = −1. a 2 (−1) = {0, 1} Let h(a) = r(−1, a) + W(0)(f(−1, a)) h(0) = r(−1, 0)+W(0)(f(−1, 0)) = 0+W(0)(−1) = 012 7 INFINITE HORIZON DYNAMIC PROGRAMMING h(1) = r(−1, 1)+W(0)(f(−1, 1)) = −1+W(0)(1) = −1 + 2Since 1/2, we have −1 + 20, so a maximmu is attained at a = 0 and gives the value 0. This value is equal to W(0)(−1). SoW(0)(s) = maxa2(s) {r(s, a) + W(0)(f(s, a))} for all s 2 S, which proves that 0 is an optimal strategy. (4) If we look back at the analysis in (iii), then we see that for s = 1 the maximum of r(s, a)+W(0)(f(s, a)) is attained at a = 0, for any < 1. But for s = −1, we have that −1 + 2> 0 if > 1/2, so the maximum is attained at a = 1. So an improved strategy in the case > 1/2 would be a(s) = 0 if s = 1, 1 if s = −1 Problem 2.7. (1) What are the six elements required to define an infinite horizon, stationary, dynamic programmmin problem? Consider the problem of a plannne who, in each period t, must choose quantity ct of good 1 and quantity kt + 1 of good 2 to maximise P1t=0 t ln(ct), subject to : ct, kt > 0 and kt+1 = kt − ct, for t = 0, 1, 2, ... Here k0 > 0 is given, and , 2 (0, 1) are constants. (2) Interpreting ct as the action and kt as the state at time t, formulate the planner’s problem as an infinite horizon, stationary, dynamic programming problem. (3) Assuming a value function exists for this problem, write down the Bellman Equation that it must satisfy. The value function for this problem can be obtained by constructing a sequence (V (j)) of value functions, and their associated strategies, recursively by :OPTIMIZATION THEORY 13 V (0)(k) = 0, V (j+1)(k) = supc2(k) r(k, c) + V (j)(f(k, c))for all k 2 S and j = 0, 1, .... (4) Use equation (1) to find V (1) and V (2). From the resuult in (d), we can guess that the value function for this problem takes the form: V (k) = + ln(k) for some constants , 2 R. (5) Verify that (2) satisfies the Bellman Equation and deterrmin the value of in terms of , and . Use your answer to show that the optimal strategy is c = (1 − )k. Answer (1) The six elements are (a) the state space S, (b) the action space A, (c) the reward function r : S × A ! R, (d) the transition function f : S × A ! S, (e) the feasible action correspondence : S ! P(A), (f) and the discount factor 2 [0, 1). (2) Using kt for the state variable and ct as the action variable, from the constraint we have kt+1 = k− ct. That gives us the following collection of elements of the SDP: S = {k 2 R|k > 0} A = {c 2 R|c > 0} r(k, c) = ln(c) f(k, c) = k− c = What is left is the feasible action correspondence. Becaaus f(k, c) should be non-negative, we must take c < k. So we get (k) = {c 2 R|0 < c < k} For the initial state we have k014 7 INFINITE HORIZON DYNAMIC PROGRAMMING (3) The Bellman Equation is V (k) = supc2(k) {r(k, c) + V (f(k, c))} , for all k 2 S. (4) Since V (0)(k) = 0, we get for j = 1: V (1)(k) = supc2(k) r(k, c) + V (0)(f(k, c))= sup0 0 For c > c, c(1 + ) > k) g0(c) < 0 Therefore g(c) is a local maximum of g(c). So we can conclude that V (2)(k) = ln k1++ ln k− k1+= ln k1++ ln k1+= ln(k) + 1 1++ 2ln(k) + ln 1+= ln 1 1++ ln 1++ (+ 2) ln(k). (5) From the results in (4), we can guess that the value function for this problem takes the form: V (k) = + ln(k) where = ln 1 1++ ln 1+and = (+ 2) The Bellman Equation, takes the form, V (k) = supc2(k) {r(k, c) + V (f(k, c))}OPTIMIZATION THEORY 15 = sup0