MDP-calculation-exercise

• 48 min read • 9424 words
Tags: MDP Re-Le
Categories: Introduction to Artificial Intelligence

discussion3discussion3

exercise1exercise1

In micro-blackjack, you repeatedly draw a card (with replacement) that is equally likely to be a 2, 3, or 4. You can either Draw or Stop if the total score of the cards you have drawn is less than 6. If your total score is 6 or higher, the game ends, and you receive a utility of 0. When you Stop, your utility is equal to your total score (up to 5), and the game ends. When you Draw, you receive no utility. There is no discount (γ\gamma = 1). Let’s formulate this problem as an MDP with the following states: 0, 2, 3, 4, 5 and a Done state, for when the game ends.

a.a. 转移函数与奖励函数

首先,我们定义 MDP 的基本组件:

  • 状态: S={0,2,3,4,5,Done}S = \{0, 2, 3, 4, 5, \text{Done}\}。其中 Done 是终止状态。
  • 动作: A(s)={Draw,Stop}A(s) = \{\text{Draw}, \text{Stop}\},对于 s<6s < 6
  • 折扣因子: γ=1\gamma = 1

奖励函数 R(s,a,s)R(s, a, s')

奖励函数定义了在状态 ss 执行动作 aa 并转移到 ss' 后获得的即时奖励。

  1. 当动作为 Draw (抽牌) 时: 无论结果如何,抽牌动作本身没有即时奖励。 R(s,Draw,s)=0R(s, \text{Draw}, s') = 0
  2. 当动作为 Stop (停止) 时: 玩家获得等于当前点数的奖励,游戏结束。 R(s,Stop,Done)=sR(s, \text{Stop}, \text{Done}) = s

转移函数 T(s,a,s)T(s, a, s')

转移函数定义了在状态 ss 执行动作 aa 后,转移到状态 ss' 的概率。

  1. 当动作为 Stop (停止) 时: 游戏立即结束,确定性地转移到 Done 状态。
T(s,Stop,Done)=1T(s, \text{Stop}, \text{Done}) = 1
  1. 当动作为 Draw (抽牌) 时: 以 13\frac{1}{3} 的等概率抽到 2, 3, 或 4。如果新点数 s6s' \ge 6,则爆牌,状态转移到 Done。
    • 从状态 0:
      • T(0,Draw,2)=1/3T(0, \text{Draw}, 2) = 1/3
      • T(0,Draw,3)=1/3T(0, \text{Draw}, 3) = 1/3
      • T(0,Draw,4)=1/3T(0, \text{Draw}, 4) = 1/3
    • 从状态 2:
      • T(2,Draw,4)=1/3T(2, \text{Draw}, 4) = 1/3
      • T(2,Draw,5)=1/3T(2, \text{Draw}, 5) = 1/3
      • T(2,Draw,Done)=1/3T(2, \text{Draw}, \text{Done}) = 1/3 (因 2+4=62+4=6 爆牌)
    • 从状态 3:
      • T(3,Draw,5)=1/3T(3, \text{Draw}, 5) = 1/3
      • T(3,Draw,Done)=2/3T(3, \text{Draw}, \text{Done}) = 2/3 (因 3+3=63+3=63+4=73+4=7 爆牌)
    • 从状态 4:
      • T(4,Draw,Done)=1T(4, \text{Draw}, \text{Done}) = 1 (无论抽什么都爆牌)
    • 从状态 5:
      • T(5,Draw,Done)=1T(5, \text{Draw}, \text{Done}) = 1 (无论抽什么都爆牌)

b.b. 值迭代

值迭代的目标是找到最优价值函数 VV^*。我们使用贝尔曼最优方程进行迭代计算,并定义 V(Done)=0V(\text{Done}) = 0

Vk+1(s)=maxaAsT(s,a,s)[R(s,a,s)+γVk(s)]V_{k+1}(s) = \max_{a \in A} \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V_k(s') \right]
  • V0V_0 (初始化): V0(s)=0V_0(s) = 0 对所有状态 ss
  • V1V_1 (第1轮迭代):
    • V1(0)=max{Stop:0,Draw:13(0+0+0)}=0V_1(0) = \max\{\text{Stop}: 0, \text{Draw}: \frac{1}{3}(0+0+0)\} = 0
    • V1(2)=max{Stop:2,Draw:13(0+0+0)}=2V_1(2) = \max\{\text{Stop}: 2, \text{Draw}: \frac{1}{3}(0+0+0)\} = 2
    • V1(3)=max{Stop:3,Draw:13(0+0+0)}=3V_1(3) = \max\{\text{Stop}: 3, \text{Draw}: \frac{1}{3}(0+0+0)\} = 3
    • V1(4)=max{Stop:4,Draw:0}=4V_1(4) = \max\{\text{Stop}: 4, \text{Draw}: 0\} = 4
    • V1(5)=max{Stop:5,Draw:0}=5V_1(5) = \max\{\text{Stop}: 5, \text{Draw}: 0\} = 5
  • V2V_2 (第2轮迭代):
    • V2(0)=max{0,13(V1(2)+V1(3)+V1(4))}=max{0,13(2+3+4)}=3V_2(0) = \max\{0, \frac{1}{3}(V_1(2)+V_1(3)+V_1(4))\} = \max\{0, \frac{1}{3}(2+3+4)\} = 3
    • V2(2)=max{2,13(V1(4)+V1(5)+V1(Done))}=max{2,13(4+5+0)}=3V_2(2) = \max\{2, \frac{1}{3}(V_1(4)+V_1(5)+V_1(\text{Done}))\} = \max\{2, \frac{1}{3}(4+5+0)\} = 3
    • V2(3)=max{3,13(V1(5)+0+0)}=max{3,53}=3V_2(3) = \max\{3, \frac{1}{3}(V_1(5)+0+0)\} = \max\{3, \frac{5}{3}\} = 3
    • V2(4)=max{4,0}=4V_2(4) = \max\{4, 0\} = 4
    • V2(5)=max{5,0}=5V_2(5) = \max\{5, 0\} = 5

这个迭代的计算逻辑是这样的:我当前的状态是ss,执行一个Action进入了ss'。而这个新的状态的价值就是Vk1(s)V_{k-1}(s')k1k-1代表的是我消耗了一个Action。这个值正好是前一次迭代计算的。

  • V3V_3 (第3轮迭代):
    • V3(0)=max{0,13(V2(2)+V2(3)+V2(4))}=max{0,13(3+3+4)}=10/33.33V_3(0) = \max\{0, \frac{1}{3}(V_2(2)+V_2(3)+V_2(4))\} = \max\{0, \frac{1}{3}(3+3+4)\} = 10/3 \approx 3.33
    • V3(2)=max{2,13(V2(4)+V2(5)+0)}=max{2,13(4+5)}=3V_3(2) = \max\{2, \frac{1}{3}(V_2(4)+V_2(5)+0)\} = \max\{2, \frac{1}{3}(4+5)\} = 3
    • V3(3)=max{3,13(V2(5)+0+0)}=max{3,53}=3V_3(3) = \max\{3, \frac{1}{3}(V_2(5)+0+0)\} = \max\{3, \frac{5}{3}\} = 3
    • V3(4)=4V_3(4) = 4
    • V3(5)=5V_3(5) = 5
  • V4V_4(第4轮迭代):
    • V4(0)=max{0,13(V3(2)+V3(3)+V3(4))}=max{0,13(3+3+4)}=10/33.33V_4(0) = \max\{0, \frac{1}{3}(V_3(2)+V_3(3)+V_3(4))\} = \max\{0, \frac{1}{3}(3+3+4)\} = 10/3 \approx 3.33
    • V4(2)=max{2,13(V3(4)+V3(5)+0)}=3V_4(2) = \max\{2, \frac{1}{3}(V_3(4)+V_3(5)+0)\} = 3
    • V4(3)=max{3,13(V3(5)+0+0)}=3V_4(3) = \max\{3, \frac{1}{3}(V_3(5)+0+0)\} = 3
    • V4(4)=4V_4(4) = 4
    • V4(5)=5V_4(5) = 5
状态 (States)02345
V0V_000000
V1V_102345
V2V_233345
V3V_33.333345
V4V_43.333345

在第4次迭代时,价值函数收敛。

c.c. 最优策略

我们从收敛后的价值函数 VV4V^ \approx V_4 中提取最优策略 π\pi^*

状态 (States)02345
π\pi ^ *DrawDrawStopStopStop

决策依据:

  • 状态0: 价值(Draw)3.33>价值(Stop)=0\text{价值(Draw)} \approx 3.33 > \text{价值(Stop)} = 0
  • 状态2: 价值(Draw)=3>价值(Stop)=2\text{价值(Draw)} = 3 > \text{价值(Stop)} = 2
  • 状态3: 价值(Stop)=3>价值(Draw)=13V(5)1.67\text{价值(Stop)} = 3 > \text{价值(Draw)} = \frac{1}{3}V^*(5) \approx 1.67
  • 状态4: 价值(Stop)=4>价值(Draw)=0\text{价值(Stop)} = 4 > \text{价值(Draw)} = 0
  • 状态5: 价值(Stop)=5>价值(Draw)=0\text{价值(Stop)} = 5 > \text{价值(Draw)} = 0

d.d. 策略迭代

从给定的初始策略 πi\pi_i 开始进行一次迭代。

初始策略 πi\pi_i

状态 (States)02345
πi\pi_iDrawStopDrawStopDraw
i.i. 策略评估

计算在固定策略 πi\pi_i 下的专属价值函数 VπiV^{\pi_i}。这需要解一个线性方程组:

Vπ(s)=sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') \left[ R(s, \pi(s), s') + \gamma V^{\pi}(s') \right]
  • V(0)=13V(2)+13V(3)+13V(4)V(0) = \frac{1}{3}V(2) + \frac{1}{3}V(3) + \frac{1}{3}V(4)
  • V(2)=2V(2) = 2
  • V(3)=13V(5)V(3) = \frac{1}{3}V(5)
  • V(4)=4V(4) = 4
  • V(5)=0V(5) = 0

这里要注意:策略评估的VπiV^{\pi_i}和值迭代中的ViV_i不是一个东西!策略评估的VπiV^{\pi_i}执行当其那状态已知的Action得到的Value;而值迭代中,Action是未知的,因此计算所有可能的加权Value。

解得: V(5)=0    V(3)=0    V(0)=13(2+0+4)=2V(5)=0 \implies V(3)=0 \implies V(0) = \frac{1}{3}(2+0+4) = 2

状态 (States)02345
VπiV^{\pi_i}22040
ii.ii. 策略改进

基于 VπiV^{\pi_i},为每个状态寻找一个可能更好的动作,构成新策略 πi+1\pi_{i+1}πi+1(s)=argmaxaAsT(s,a,s)[R(s,a,s)+γVπi(s)]\pi_{i+1}(s) = \underset{a \in A}{\text{argmax}} \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V^{\pi_i}(s') \right]

策略改进中的VpiiV^{pi_i}的计算方法和策略评估中的一致,只是策略改进会基于前一个策略进行修正,传入新的Action,把原有策略换成新的策略。

  • 状态0: max{Stop:0,Draw:13(2+0+4)=2}    Draw\max\{\text{Stop}:0, \text{Draw}:\frac{1}{3}(2+0+4)=2\} \implies \text{Draw}
  • 状态2: max{Stop:2,Draw:13(4+0+0)1.33}    Stop\max\{\text{Stop}:2, \text{Draw}:\frac{1}{3}(4+0+0)\approx 1.33\} \implies \text{Stop}
  • 状态3: max{Stop:3,Draw:13(0+0+0)=0}    Stop\max\{\text{Stop}:3, \text{Draw}:\frac{1}{3}(0+0+0)=0\} \implies \text{Stop}
  • 状态4: max{Stop:4,Draw:0}    Stop\max\{\text{Stop}:4, \text{Draw}:0\} \implies \text{Stop}
  • 状态5: max{Stop:5,Draw:0}    Stop\max\{\text{Stop}:5, \text{Draw}:0\} \implies \text{Stop}
状态 (States)02345
πi+1\pi_{i+1}DrawStopStopStopStop