强化学习 | 01 目标函数

强化学习中的两个重要函数

动作价值函数 $Q_\pi(s, a)$ 衡量的是在给定策略 $\pi$ 下，智能体从状态 $\displaystyle{ s }$ 开始，并采取特定动作 $\displaystyle{ a }$ 后，预期能获得的累积折扣奖励（Expected Discounted Return）。（如果我在状态 $\displaystyle{ s }$ 选择了动作 $\displaystyle{ a }$ ，然后从下一步开始严格遵循策略 $\pi$ ，我预期能获得多少总回报？）

状态价值函数与动作价值函数的关系：

状态价值函数 $V_\pi(s)$ 是在状态 $\displaystyle{ s }$ 下，所有可能动作的 $\displaystyle{ Q }$ 值的加权平均，权重是策略 $\pi$ 选择这些动作的概率，是 $Q_\pi(s, a)$ 的期望。即，一个状态的价值 $V_\pi(s)$ 等于你在这个状态下，根据策略 $\pi$ 选择所有动作的平均价值。
$Q_\pi(s, a)$ 可以用 $V_\pi$ 来递归定义，这是 $\displaystyle{ Q }$ 函数的贝尔曼方程（Bellman Equation）的核心：

Q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t=s, A_t=a]

即，在状态 $\displaystyle{ s }$ 采取动作 $\displaystyle{ a }$ 的价值 $Q_\pi(s, a)$ ，等于即时奖励 $\displaystyle{ R _{ \left\lbrace t + 1 \right\rbrace } }$ 加上下一个状态 $\displaystyle{ S _{ \left\lbrace t + 1 \right\rbrace } }$ 的折扣价值 $\gamma V_\pi(S_{t+1})$ 。

Vanilla Policy Gradient

策略梯度是强化学习中最基础的一类方法，它直接学习和优化策略 $\displaystyle{ \pi _{ \theta } }$ 。

训练

在第 $\displaystyle{ i }$ 次训练迭代中，算法使用上一次迭代得到的当前策略网络 $\theta^{i-1}$ 。这个策略 $\pi_{\theta^{i-1}}$ 被用来与环境进行交互，通常是进行一整条轨迹（episode）的采样，或者采样固定数量的步骤 $\displaystyle{ N }$ 。
每一步的动作选择： 在环境的每一步中，策略网络 $\pi_{\theta^{i-1}}$ 接收当前的状态 $\displaystyle{ s _{ t } }$ 作为输入，然后根据其当前的策略 $\pi_{\theta^{i-1}}(a|s_t)$ 来采样或选择¹一个作 $\displaystyle{ a _{ t } }$ 。

执行与记录： 选定的动作 $\displaystyle{ a _{ t } }$ 被送给环境执行。环境返回一个新的状态 $\displaystyle{ s _{ t + 1 } }$ 和一个奖励 $\displaystyle{ r _{ t } }$ 。这一步的状态-动作对 $\displaystyle{ \left( s _{ t } , a _{ t } \right) }$ 以及后续计算所需的奖励（如，优势函数）被记录下来。最终得到的数据集即为 $\displaystyle{ \ \left\lbrace s _{ 1 } , a _{ 1 } \ \right\rbrace , \ \left\lbrace s _{ 2 } , a _{ 2 } \ \right\rbrace , \ldots , \ \left\lbrace s _{ N } , a _{ N } \ \right\rbrace }$ 。
计算目标函数，利用梯度上升策略最大化目标函数 $\displaystyle{ J \left( \theta \right) \iff L ^{ \text{PG} } \left( \theta \right) }$ ，调整策略参数 $\theta$ ，使得在状态 $\displaystyle{ s _{ t } }$ 下选择更优的行动 $\displaystyle{ a _{ t } }$ 的概率更大，从而提高整体的累积奖励。

\theta \leftarrow \theta + \eta \nabla L^{\text{PG}}

策略梯度定理

强化学习的最原始的目标是希望一个策略平均来看能够带来更大的总回报，即最大化策略 $\displaystyle{ \pi _{ \theta } }$ 下的累积奖励的期望值，也等价于起始状态 $\displaystyle{ s _{ 0 } }$ 在策略 $\displaystyle{ \pi _{ \theta } }$ 下的状态价值函数 $\displaystyle{ V _{ \pi _{ \theta } } \left( s _{ 0 } \right) }$

\displaystyle{ J \left( \theta \right) = V _{ \pi _{ \theta } } \left( s _{ 0 } \right) = \mathbb{ E } _{ \pi _{ \theta } } \left[ \sum _{ t = 0 } ^{ T } \gamma ^{ t } r _{ t } \right] \coloneqq \mathbb{ E } _{ \tau ～ P _{ \theta } \left( \tau \right) } \left[ G \left( \tau \right) \right] }

（上式我们用 $\displaystyle{ G \left( \tau \right) }$ 表示轨迹 $\displaystyle{ \tau }$ 下的累积奖励 $\displaystyle{ \sum _{ t = 0 } ^{ T } \gamma ^{ t } r _{ t } }$ ）

$\displaystyle{ \pi _{ \theta } }$ 会产生无数个发生概率各不相同的轨迹 $\displaystyle{ \tau }$ （概率分布记为 $\displaystyle{ P _{ \theta } \left( \tau \right) }$ ），在不同轨迹下的累积奖励 $\displaystyle{ G \left( \tau \right) }$ 也不同，因此 $\displaystyle{ J \left( \theta \right) }$ 是一个非常复杂的期望值。虽然难以计算，但我们可以用大数定律（蒙特卡洛方法）无偏地近似这个期望。

理论上我们求 $\displaystyle{ \nabla J \left( \theta \right) }$ 并应用梯度上升策略就可以实现最大化这个目标函数，

\displaystyle{ \nabla J \left( \theta \right) = \nabla \sum _{ \tau } P _{ \theta } \left( \tau \right) G \left( \tau \right) = \sum _{ \tau } \nabla P _{ \theta } \left( \tau \right) G \left( \tau \right) }

但问题在于，一条完整轨迹 $\displaystyle{ \tau }$ 的概率 $\displaystyle{ P _{ \theta } \left( \tau \right) }$ 是每一步环境转移概率 $\displaystyle{ P \left( s _{ t + 1 } \mid s _{ t } , , a _{ t } \right) }$ 和智能体动作选择概率 $\displaystyle{ \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) }$ 的连乘

\displaystyle{ P _{ \theta } \left( \tau \right) = P \left( s _{ 0 } \right) \prod _{ t = 0 } ^{ T } P \left( s _{ t + 1 } \mid s _{ t } , , a _{ t } \right) \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) }

由于表达式中含有未知的环境转移概率，因此即便我们解析地写出 $\displaystyle{ P _{ \theta } \left( \tau \right) }$ 的完整形式，也因为连乘导致导数解析式中含环境转移概率，由于不知道环境转移概率，导致导数值不确定。即， $\displaystyle{ \nabla P _{ \theta } \left( \tau \right) }$ 是不可计算的。

为了巧妙避开环境转移概率及其导数，我们注意到，如果我们把 $\displaystyle{ P _{ \theta } \left( \tau \right) }$ 改写成 $\displaystyle{ \log P _{ \theta } \left( \tau \right) }$ 的形式，就可以把连乘变成加法。由于环境转移概率与 $\displaystyle{ \theta }$ 无关，这样它们就会从导数表达式中消失。

又注意到 $\displaystyle{ \nabla \log P = \frac{ \partial \log P }{ \partial \theta } = \frac{ \partial \log P }{ \partial P } \frac{ \partial P }{ \partial \theta } = \frac{ \nabla P }{ P } }$ ，我们得到了

\nabla P(\tau; \theta) = P(\tau; \theta) \frac{\nabla P(\tau; \theta)}{P(\tau; \theta)} = P(\tau; \theta) \nabla \log P(\tau; \theta)

恰巧地把 $\displaystyle{ \nabla J \left( \theta \right) }$ 写成了一个期望的形式

\displaystyle{ \nabla J \left( \theta \right) = \mathbb{ E } _{ \tau ～ P _{ \theta } \left( \tau \right) } \left[ \nabla \log P _{ \theta } \left( \tau \right) G \left( \tau \right) \right] }

轨迹概率可以逐步拆开为单步的形式

\displaystyle{ \nabla = \mathbb{ E } _{ \tau ～ P _{ \theta } \left( \tau \right) } \left[ \left( \sum _{ t = 0 } ^{ T } \nabla \log \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) \right) G \left( \tau \right) \right] }

而同时，考虑到因果性（Causality），虽然 $\displaystyle{ G \left( \tau \right) }$ 是整条轨迹的回报，但我们知道在时间步 $\displaystyle{ t }$ 采取的动作 $\displaystyle{ a _{ t } }$ ，只能影响其之后的奖励，而不能影响其之前的奖励。因此，对于在时间步 $\displaystyle{ t }$ 发生的事件 $\displaystyle{ \left( s _{ t } , a _{ t } \right) }$ 来说，我们只需要考虑从 $\displaystyle{ t }$ 时刻开始的未来回报，即累积奖励 $\displaystyle{ G _{ t } }$ 即可。

\displaystyle{ \nabla = \mathbb{ E } _{ \tau ～ P _{ \theta } \left( \tau \right) } \left[ \sum _{ t = 0 } ^{ T } \nabla \log \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) G _{ t } \right] }

这样我们巧妙避开了环境转移概率及其导数。我们终于可以用大数定律（蒙特卡洛方法）无偏地近似这个期望，从而求得 $\displaystyle{ \nabla J \left( \theta \right) }$ 的估计值了。

我们发现这个表达式其实正好是另一个函数的梯度

L^{\text{PG}}(\theta) = \mathbb{E}_\tau [\log \pi_{\theta}(a_t|s_t) G_t]

因此，在实际实现中，我们优化的目标函数就是它了。我们通过最大化这个目标函数来间接地最大化原始的累积奖励期望 $\displaystyle{ J \left( \theta \right) }$ 。

上面的推导过程又称为策略梯度定理（Policy Gradient Theorem）。

优势函数

即使使用 $\displaystyle{ G _{ t } }$ ，蒙特卡洛估计的方差仍然非常大。这是因为 $\displaystyle{ G _{ t } }$ 随每一次采样的轨迹而剧烈变化。高方差意味着训练不稳定，收敛速度慢。

因为对于任何不依赖于动作 $\displaystyle{ a _{ t } }$ 的函数 $\displaystyle{ b \left( s _{ t } \right) }$ ，以下恒等式成立：

\mathbb{E}_{\pi_\theta} [\nabla \log \pi_\theta(a_t|s_t) b(s_t)] = 0

这表明在梯度中减去 $\displaystyle{ b \left( s _{ t } \right) }$ 不会改变梯度的期望（保持无偏性）。

为了降低方差，我们可以在不改变期望梯度 $\nabla J(\theta)$ 的前提下，引入一个基线函数 $\displaystyle{ b \left( s _{ t } \right) }$ ，并将权重 $\displaystyle{ G _{ t } }$ 替换为 $\displaystyle{ G _{ t } - b \left( s _{ t } \right) }$ ，移除 $\displaystyle{ G _{ t } }$ 中与动作选择无关、只与状态本身有关的随机波动。

理论上，能最大限度降低方差的最优基线就是状态价值函数 $V_{\pi_\theta}(s_t)$ 。【推导很复杂】

$V_{\pi_\theta}(s_t)$ 代表在状态 $\displaystyle{ s _{ t } }$ 下，智能体平均能获得的长期回报。所以将 $\displaystyle{ G _{ t } }$ 替换为 $(G_t - V_{\pi_\theta}(s_t))$ 。

总结：目标函数

因此，目标函数为

\text{maximize } L^{\text{PG}}(\theta) = \mathbb{E}_{t} [\log \pi_{\theta}(a_t|s_t) \hat{A}_t]

其中

\hat{A}_t = G_t - b

\displaystyle{ G _{ t } = \sum _{ k = 0 } ^{ T - t } \gamma ^{ k } r _{ t + k } }

$\hat{A}_t$ 是优势函数（Advantage function），它告诉我们在状态 $\displaystyle{ s _{ t } }$ 下采取行动 $\displaystyle{ a _{ t } }$ 比平均行动好多少。
优势函数中， $\displaystyle{ G _{ t } }$ 是实际观测到的当前时刻直到回合结束的累积奖励，由未来每个状态 $\displaystyle{ s _{ t } }$ 下采取行动 $\displaystyle{ a _{ t } }$ 得到的逐状态奖励 $\displaystyle{ r _{ t } }$ 和一个折扣因子 $\displaystyle{ \gamma < 1 }$ 组成，评估未来可能获得的总体回报。过去的奖励是不可改变的历史，与我们现在决策的价值无关。折扣因子的存在，则确保了即时奖励比遥远的未来奖励价值更大「现在比未来更有价值」。
$\displaystyle{ b }$ 是一个基线（baseline），通常取 $\displaystyle{ G _{ t } }$ 的平均值或其他估计值。这样当优势函数为正时，可以认为该行动比平均行动好，最大化目标函数；反之则是最小化目标函数。引入基线的目的是降低梯度估计的方差，从而让训练更稳定。

Importance Sampling

On/off-policy 与重要性采样

On-policy 表示学习的智能体与和环境交互的智能体是同一个。
Off-policy 表示学习采取行动的智能体和与环境交互的智能体是不同的。

On-policy 方法的数据利用效率低。主要原因是其数据的“新鲜度”要求极高且不可复用。

在 On-policy 学习中，用于更新策略 $\pi$ 的数据，必须是由当前策略 $\pi$ 自身与环境互动所采集的。策略更新后，数据即刻作废（Staleness），每进行一次策略 $\pi$ 的更新，旧策略 $\displaystyle{ \pi _{ \text{old} } }$ 采集到的数据就不能再用于训练新策略 $\displaystyle{ \pi _{ \text{new} } }$ 。因为如果继续使用，就会违背“更新策略 $\pi$ 的数据，必须是由当前策略 $\pi$ 自身采集”的原则，导致训练的目标和实际数据的分布不一致，从而可能引起偏差（Bias）或高方差（High Variance），甚至使训练不稳定。
对于策略 $\pi$ 与环境互动所采集的数据，在强化学习中，通常是指完整的轨迹（Trajectory） $\displaystyle{ \left( s _{ 1 } , a _{ 1 } , r _{ 1 } , s _{ 2 } , a _{ 2 } , r _{ 2 } , \ldots , s _{ N } , a _{ N } , r _{ N } \right) }$ 。在基于梯度（如策略梯度）的方法中，这些数据用于计算策略梯度 $\nabla J(\theta)$ 的期望
上面的环境交互通常是强化学习中最耗时的部分，每次更新都需要重新进行大量采样，导致总训练时间很长。

为了用旧策略 $\pi_{\theta_{old}}$ （行为策略）的数据来计算新策略 $\pi_{\theta}$ （目标策略）下的期望，我们引入重要性采样。

重要性采样是一种统计学工具，其核心作用是允许我们使用一个不同的概率分布（Off-policy）来估计目标概率分布（On-policy）下的期望值。

假设我们有两个概率分布 $\displaystyle{ p \left( x \right) }$ 和 $\displaystyle{ q \left( x \right) }$ ，我们想要计算在分布 $\displaystyle{ p \left( x \right) }$ 下的某个函数 $\displaystyle{ f \left( x \right) }$ 的期望值 $\mathbb{E}_{p}[f(x)]$ 。如果直接从 $\displaystyle{ p \left( x \right) }$ 采样困难，我们可以从另一个更容易采样的分布 $\displaystyle{ q \left( x \right) }$ 进行采样，并通过调整权重来获得正确的期望值：

\mathbb{E}_{p}[f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{q}[f(x) \frac{p(x)}{q(x)}]

其中 $\rho(x) = \frac{p(x)}{q(x)}$ 被称为重要性权重（Importance Weight）。

Off-policy 应用： 我们因此可以引入两个分布：

目标分布 $\displaystyle{ p }$ ： 是当前要优化的策略 $\pi_{\theta}$ （Target Policy）。
采样分布 $\displaystyle{ q }$ ： 是用于收集数据的策略 $\displaystyle{ b }$ （Behavior Policy）。

通过重要性采样，我们可以用由行为策略 $\displaystyle{ b }$ 采集的数据（Off-policy 数据）来估计目标策略 $\pi_{\theta}$ 的期望，从而实现 Off-policy 学习。

Off-policy 的策略梯度估计

在 off-policy 学习中，我们从与正在优化的策略不同的其他策略中采样轨迹。像近端策略优化算法（PPO）和广义近端策略优化算法（GRPO）等流行的 PG 的 off-policy 变体，会使用来自 $\displaystyle{ \pi _{ \text{old} } }$ 的轨迹来优化当前策略。Off-policy 的策略梯度估计是

\hat{g}_\text{off-policy} = \frac1N \sum_{i=1}^N\sum_{t=0}^T \frac{\pi_\theta(a_t^{(i)}|s_t^{(i)})}{\pi_{\theta_{old}}(a_t^{(i)}|s_t^{(i)})}\nabla_\theta\log\pi_\theta(a_t^{(i)}|s_t^{(i)})R(\tau^{(i)})

这看起来像是 Vanilla PG 的重要性采样版本。

TRPO, PPO

从 Off-policy 的策略梯度估计出发，我们可以构造新的目标函数，使得其导数即为这个公式。

TRPO 给出了一种构造方式

\displaystyle{ \text{maximize} _{ \theta } \quad \hat{ \mathbb{ E } } _{ \tau } \left[ \frac{ \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) }{ \pi _{ \theta _{ \text{old} } } \left( a _{ t } \mid s _{ t } \right) } \hat{ A } _{ t } \right] }

subject to $\displaystyle{ \hat{ \mathbb{ E } } _{ \tau } \left[ \text{KL} \left( \pi _{ \theta } \left( \cdot \mid s _{ t } \right) \Vert \pi _{ \theta _{ \text{old} } } \left( \cdot \mid s _{ t } \right) \right) \right] \leqslant \delta }$ .

然而，TRPO 虽然有理论上的单调改进保证，但其带硬性约束的优化问题计算复杂（需要二阶近似、共轭梯度和线性搜索等）。

PPO（Proximal Policy Optimization，近端策略优化）旨在保留 TRPO 限制策略更新幅度的优点，同时大大简化优化过程。

PPO 通过修改目标函数，将 TRPO 的硬性 KL 约束替换为一种软性约束 (PPO-KL) 或截断机制 (PPO-Clip, 最常用)，使其可以使用标准的一阶优化方法（如 SGD 或 Adam）进行优化。

PPO-Clip：引入了一个截断函数，将概率比率 $\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}$ 限制在一个范围 $[1-\epsilon, 1+\epsilon]$ 内。
$\displaystyle{ L ^{ \text{CLIP} } \left( \theta \right) = \hat{ \mathbb{ E } } _{ \tau } \left[ \min \left( r _{ t } \left( \theta \right) \hat{ A } _{ t } , \text{clip} \left\lbrace r _{ t } \left( \theta \right) \right\rbrace \hat{ A } _{ t } \right) \right] }$
其中 $\displaystyle{ r _{ t } \left( \theta \right) = \frac{ \pi _{ \theta } \left( a _{ t } \mid s _{ t } \right) }{ \pi _{ \theta _{ \text{old} } } \left( a _{ t } \mid s _{ t } \right) } }$ .
PPO-KL / PPO-adaptive Penalty：更直接地模仿 TRPO 的 KL 散度约束，但将其作为目标函数中的惩罚项而不是硬性约束。
$L^{KL}(\theta) = \mathbb{E}_{s, a \sim \pi_{\theta_{old}}} \left[ r_t(\theta) A_t - \beta D_{KL}(\pi_{\theta_{old}}(\cdot|s) || \pi_{\theta}(\cdot|s)) \right]$
其中 $\beta$ 是一个自适应的惩罚系数。如果新旧策略的平均 KL 散度 $\bar{D}_{KL}$ 大于目标 KL 阈值 $d_{target}$ ，则增大 $\beta$ 以更严格地惩罚策略变化。

GRPO

GRPO 是一个更通用的策略优化框架，它推广了 TRPO 和 PPO。它允许使用各种不同的距离度量（不限于 KL 散度）来定义新旧策略之间的信任区域，并提供了一种统一的、可扩展的方法来计算其梯度和更新。关于 GRPO 的介绍，可以参见 HW3 的 README.

Advantage estimation. The core idea of GRPO is to sample many outputs for each question from the policy $\pi_\theta$ and use them to compute a baseline. This is convenient because we avoid the need to learn a neural value function $V_\phi(s)$ , which can be hard to train and is cumbersome from the systems perspective. For a question $\displaystyle{ q }$ and group outputs $\{o^{(i)}\}_{i=1}^G\sim\pi_\theta(\cdot|q)$ , let $\displaystyle{ r ^{ \left\lbrace \left( i \right) \right\rbrace } = R \left( q , o ^{ \left\lbrace \left( i \right) \right\rbrace } \right) }$ be the reward for the $\displaystyle{ i }$ -th output. DeepSeekMath and DeepSeek R1 compute the group-normalized reward for the $\displaystyle{ i }$ -th output as

A^{(i)} = \frac{r^{(i)}-mean(r^{(1)},r^{(2)},\cdots, r^{(G)})}{std(r^{(1)}, r^{(2)},\cdots, r^{(G)}) + advantage\_eps}\quad (Eq.28)

where $\texttt{advantage\_eps}$ is a small constant to prevent division by zero. Note that this advantage $\displaystyle{ A ^{ \left\lbrace \left( i \right) \right\rbrace } }$ is the same for each token in the response, i.e., $A_t^{(i)} = A^{(i)}, \forall t\in 1,\cdots, |o^{(i)}|$ , so we drop the $\displaystyle{ t }$ subscript in the following.

GRPO objective. The GRPO objective combines three ideas:

Off-policy policy gradient;
Computing advantage $\displaystyle{ A ^{ \left\lbrace \left( i \right) \right\rbrace } }$ with group normalization;
A clipping mechanism, as in PPO.

The purpose of clipping is to maintain stability when taking many gradient steps on a single batch of rollouts. It works by keeping the policy $\pi_\theta$ from straying too far from the old policy.

The GRPO-Clip objective uses a min function to clip the probability ratio, preventing the policy from deviating too far from the old policy during training.

Let us first write out the full GRPO-Clip objective, and then we can build some intuition on what the clipping does (Eq.29):

\begin{align*} J_{GRPO-Clip}(\theta) &= E_{q\sim \mathcal D, \{o^{(i)}\}_{i=1}^G \sim \pi_\theta(\cdot|q)}\\&[\frac1G\sum_{i=1}^G \frac1{|o^{(i)}|}\sum_{t=1}^{|o^{(i)|}}\min (\frac{\pi_\theta(o_t^{(i)}|q,o_{<t}^{(i)})}{\pi_{\theta_{old}}(o_t^{(i)} |q,o_{<t}^{(i)})}A^{(i)}, clip(\frac{\pi_\theta(o_t^{(i)}|q,o_{<t}^{(i)})}{\pi_{\theta_{old}}(o_t^{(i)} |q,o_{<t}^{(i)})},1-\epsilon, 1+\epsilon)A^{(i)})] \end{align*}

The hyperparameter $\epsilon>0$ controls how much the policy can change. To see this, we can rewrite the per-token objective in a more intuitive way. Define the function

g(\epsilon, A^{(i)}) = \begin{cases} (1+\epsilon) A^{(i)} \quad \text{if }A^{(i)}\ge 0\\ (1-\epsilon) A^{(i)} \quad \text{if }A^{(i)} <0 \end{cases}

We can rewrite the per-token objective as

\text{per-token objective} = \min (\frac{\pi_\theta(o_t^{(i)}|q,o_{<t}^{(i)})}{\pi_{\theta_{old}}(o_t^{(i)} |q,o_{<t}^{(i)})}A^{(i)}, g(\epsilon, A^{(i)}))

We can now reason by cases. When the advantage $\displaystyle{ A ^{ \left\lbrace \left( i \right) \right\rbrace } }$ is positive, the per-token objective simplifies to

\text{per-token objective} = \min (\frac{\pi_\theta(o_t^{(i)}|q,o_{<t}^{(i)})}{\pi_{\theta_{old}}(o_t^{(i)} |q,o_{<t}^{(i)})}, 1+\epsilon) A^{(i)}

Since $\displaystyle{ A ^{ \left\lbrace \left( i \right) \right\rbrace } > 0 }$ , the objective goes up if the action $\displaystyle{ o _{ t } ^{ \left\lbrace \left( i \right) \right\rbrace } }$ becomes more likely under $\pi_\theta$ , i.e., if $\pi_\theta (o_t^{(i)}|q, o_{<t}^{(i)})$ increases. The clipping with min limits how much the objective can increase. So the policy $\pi_\theta$ is not incentivized to go very far from the old policy $\pi_{\theta_{old}}$ .

Analogously, when the advantage is negative, the model tries to drive down $\pi_\theta(o_t^{(i)}|q,o_{<t}^{(i)})$ , but is not incentivized to decrease it below $(1-\epsilon)\pi_{\theta_{old}}(o_t^{(i)}|q,o_{<t}^{(i)})$ .

对于离散动作空间，策略网络通常输出一个概率分布 $\displaystyle{ P \left( a \mid s _{ t } \right) }$ 。算法会根据这个概率分布随机采样（抽样）得到实际执行的动作 $\displaystyle{ a _{ t } }$ 。对于连续动作空间，策略网络通常输出一个均值 $\mu(s_t)$ 和一个方差 $\sigma(s_t)$ ，构成一个高斯分布，算法会从 $N(\mu(s_t), \sigma(s_t))$ 这个分布中随机采样得到实际执行的动作 $\displaystyle{ a _{ t } }$ 。 ↩