$discussion3$

$exercise1$

In micro-blackjack, you repeatedly draw a card (with replacement) that is equally likely to be a 2, 3, or 4. You can either Draw or Stop if the total score of the cards you have drawn is less than 6. If your total score is 6 or higher, the game ends, and you receive a utility of 0. When you Stop, your utility is equal to your total score (up to 5), and the game ends. When you Draw, you receive no utility. There is no discount ( $\gamma$ = 1). Let’s formulate this problem as an MDP with the following states: 0, 2, 3, 4, 5 and a Done state, for when the game ends.

$a.$ 转移函数与奖励函数

首先，我们定义 MDP 的基本组件：

状态: $S = \{0, 2, 3, 4, 5, \text{Done}\}$ 。其中 Done 是终止状态。
动作: $A(s) = \{\text{Draw}, \text{Stop}\}$ ，对于 $s < 6$ 。
折扣因子: $\gamma = 1$ 。

奖励函数 $R(s, a, s')$ ：

奖励函数定义了在状态 $s$ 执行动作 $a$ 并转移到 $s'$ 后获得的即时奖励。

当动作为 Draw (抽牌) 时: 无论结果如何，抽牌动作本身没有即时奖励。 $R(s, \text{Draw}, s') = 0$
当动作为 Stop (停止) 时: 玩家获得等于当前点数的奖励，游戏结束。 $R(s, \text{Stop}, \text{Done}) = s$

转移函数 $T(s, a, s')$ ：

转移函数定义了在状态 $s$ 执行动作 $a$ 后，转移到状态 $s'$ 的概率。

当动作为 Stop (停止) 时: 游戏立即结束，确定性地转移到 Done 状态。

T(s, \text{Stop}, \text{Done}) = 1

当动作为 Draw (抽牌) 时: 以 $\frac{1}{3}$ $\frac{1}{3}$ 的等概率抽到 2, 3, 或 4。如果新点数 $s' \ge 6$ $s^{'} \geq 6$ ，则爆牌，状态转移到 Done。
- 从状态 0:
  - $T(0, \text{Draw}, 2) = 1/3$
  - $T(0, \text{Draw}, 3) = 1/3$
  - $T(0, \text{Draw}, 4) = 1/3$
- 从状态 2:
  - $T(2, \text{Draw}, 4) = 1/3$
  - $T(2, \text{Draw}, 5) = 1/3$
  - $T(2, \text{Draw}, \text{Done}) = 1/3$ (因 $2+4=6$ 爆牌)
- 从状态 3:
  - $T(3, \text{Draw}, 5) = 1/3$
  - $T(3, \text{Draw}, \text{Done}) = 2/3$ (因 $3+3=6$ 和 $3+4=7$ 爆牌)
- 从状态 4:
  - $T(4, \text{Draw}, \text{Done}) = 1$ (无论抽什么都爆牌)
- 从状态 5:
  - $T(5, \text{Draw}, \text{Done}) = 1$ (无论抽什么都爆牌)

$b.$ 值迭代

值迭代的目标是找到最优价值函数 $V^*$ 。我们使用贝尔曼最优方程进行迭代计算，并定义 $V(\text{Done}) = 0$ 。

V_{k+1}(s) = \max_{a \in A} \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V_k(s') \right]

$V_0$ (初始化): $V_0(s) = 0$ 对所有状态 $s$ 。
$V_1$ $V_{1}$ (第1轮迭代):
- $V_1(0) = \max\{\text{Stop}: 0, \text{Draw}: \frac{1}{3}(0+0+0)\} = 0$
- $V_1(2) = \max\{\text{Stop}: 2, \text{Draw}: \frac{1}{3}(0+0+0)\} = 2$
- $V_1(3) = \max\{\text{Stop}: 3, \text{Draw}: \frac{1}{3}(0+0+0)\} = 3$
- $V_1(4) = \max\{\text{Stop}: 4, \text{Draw}: 0\} = 4$
- $V_1(5) = \max\{\text{Stop}: 5, \text{Draw}: 0\} = 5$

$V_2$ $V_{2}$ (第2轮迭代):
- $V_2(0) = \max\{0, \frac{1}{3}(V_1(2)+V_1(3)+V_1(4))\} = \max\{0, \frac{1}{3}(2+3+4)\} = 3$
- $V_2(2) = \max\{2, \frac{1}{3}(V_1(4)+V_1(5)+V_1(\text{Done}))\} = \max\{2, \frac{1}{3}(4+5+0)\} = 3$
- $V_2(3) = \max\{3, \frac{1}{3}(V_1(5)+0+0)\} = \max\{3, \frac{5}{3}\} = 3$
- $V_2(4) = \max\{4, 0\} = 4$
- $V_2(5) = \max\{5, 0\} = 5$

这个迭代的计算逻辑是这样的：我当前的状态是 $s$ ，执行一个Action进入了 $s'$ 。而这个新的状态的价值就是 $V_{k-1}(s')$ ， $k-1$ 代表的是我消耗了一个Action。这个值正好是前一次迭代计算的。

$V_3$ $V_{3}$ (第3轮迭代):
- $V_3(0) = \max\{0, \frac{1}{3}(V_2(2)+V_2(3)+V_2(4))\} = \max\{0, \frac{1}{3}(3+3+4)\} = 10/3 \approx 3.33$
- $V_3(2) = \max\{2, \frac{1}{3}(V_2(4)+V_2(5)+0)\} = \max\{2, \frac{1}{3}(4+5)\} = 3$
- $V_3(3) = \max\{3, \frac{1}{3}(V_2(5)+0+0)\} = \max\{3, \frac{5}{3}\} = 3$
- $V_3(4) = 4$
- $V_3(5) = 5$
$V_4$ $V_{4}$ (第4轮迭代):
- $V_4(0) = \max\{0, \frac{1}{3}(V_3(2)+V_3(3)+V_3(4))\} = \max\{0, \frac{1}{3}(3+3+4)\} = 10/3 \approx 3.33$
- $V_4(2) = \max\{2, \frac{1}{3}(V_3(4)+V_3(5)+0)\} = 3$
- $V_4(3) = \max\{3, \frac{1}{3}(V_3(5)+0+0)\} = 3$
- $V_4(4) = 4$
- $V_4(5) = 5$

状态 (States)	0	2	3	4	5
$V_0$	0	0	0	0	0
$V_1$	0	2	3	4	5
$V_2$	3	3	3	4	5
$V_3$	3.33	3	3	4	5
$V_4$	3.33	3	3	4	5

在第4次迭代时，价值函数收敛。

$c.$ 最优策略

我们从收敛后的价值函数 $V^ \approx V_4$ 中提取最优策略 $\pi^*$ 。

状态 (States)	0	2	3	4	5
$\pi ^ *$	Draw	Draw	Stop	Stop	Stop

决策依据:

状态0: $\text{价值(Draw)} \approx 3.33 > \text{价值(Stop)} = 0$
状态2: $\text{价值(Draw)} = 3 > \text{价值(Stop)} = 2$
状态3: $\text{价值(Stop)} = 3 > \text{价值(Draw)} = \frac{1}{3}V^*(5) \approx 1.67$
状态4: $\text{价值(Stop)} = 4 > \text{价值(Draw)} = 0$
状态5: $\text{价值(Stop)} = 5 > \text{价值(Draw)} = 0$

$d.$ 策略迭代

从给定的初始策略 $\pi_i$ 开始进行一次迭代。

初始策略 $\pi_i$ ：

状态 (States)	0	2	3	4	5
$\pi_i$	Draw	Stop	Draw	Stop	Draw

$i.$ 策略评估

计算在固定策略 $\pi_i$ 下的专属价值函数 $V^{\pi_i}$ 。这需要解一个线性方程组：

V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') \left[ R(s, \pi(s), s') + \gamma V^{\pi}(s') \right]

$V(0) = \frac{1}{3}V(2) + \frac{1}{3}V(3) + \frac{1}{3}V(4)$
$V(2) = 2$
$V(3) = \frac{1}{3}V(5)$
$V(4) = 4$
$V(5) = 0$

这里要注意：策略评估的 $V^{\pi_i}$ 和值迭代中的 $V_i$ 不是一个东西！策略评估的 $V^{\pi_i}$ 是执行当其那状态已知的Action得到的Value；而值迭代中，Action是未知的，因此计算所有可能的加权Value。

解得: $V(5)=0 \implies V(3)=0 \implies V(0) = \frac{1}{3}(2+0+4) = 2$ 。

状态 (States)	0	2	3	4	5
$V^{\pi_i}$	2	2	0	4	0

$ii.$ 策略改进

基于 $V^{\pi_i}$ ，为每个状态寻找一个可能更好的动作，构成新策略 $\pi_{i+1}$ 。 $\pi_{i+1}(s) = \underset{a \in A}{\text{argmax}} \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V^{\pi_i}(s') \right]$

策略改进中的 $V^{pi_i}$ 的计算方法和策略评估中的一致，只是策略改进会基于前一个策略进行修正，传入新的Action，把原有策略换成新的策略。

状态0: $\max\{\text{Stop}:0, \text{Draw}:\frac{1}{3}(2+0+4)=2\} \implies \text{Draw}$
状态2: $\max\{\text{Stop}:2, \text{Draw}:\frac{1}{3}(4+0+0)\approx 1.33\} \implies \text{Stop}$
状态3: $\max\{\text{Stop}:3, \text{Draw}:\frac{1}{3}(0+0+0)=0\} \implies \text{Stop}$
状态4: $\max\{\text{Stop}:4, \text{Draw}:0\} \implies \text{Stop}$
状态5: $\max\{\text{Stop}:5, \text{Draw}:0\} \implies \text{Stop}$

状态 (States)	0	2	3	4	5
$\pi_{i+1}$	Draw	Stop	Stop	Stop	Stop

discussion3discussion3discussion3

exercise1exercise1exercise1

a.a.a. 转移函数与奖励函数

b.b.b. 值迭代

c.c.c. 最优策略

d.d.d. 策略迭代

i.i.i. 策略评估

ii.ii.ii. 策略改进