GPT2 实现笔记(1)

该笔记是对 Stanford CS224-n 的 hw5 基础部分的整理，用于整理 GPT2 模型的基本实现。

1. 注意力模块实现

$a.$ 初始化

我们先初始化好注意力模块中的组件：

$Q$ ， $K$ ， $V$ 层和 dropout 层
注意力头数量等配置

self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size

# Initialize the linear transformation layers for key, value, query.
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

$b.$ `transform` 模块

transformer 模块的作用是将输入的隐状态 hidden state 转换为准备进行注意力计算的 $Q$ 、 $K$ 、 $V$ 张量。它包含以下的步骤：

将隐藏层通过 $Q$ 、 $K$ 、 $V$ 投影到对应的向量空间中：

proj = linear_layer(x)

进行多头注意力拆分。我们像论文中说的一样，将大小为 $hd$ 的注意力头拆分成 $h$ 个大小为 $d$ 的注意力头：

proj = rearrange(proj, 'b t (h d) -> b t h d', h=self.num_attention_heads)

转换成便于批处理的格式。我们把转换后的投影转置成便于批处理的 seq_len, attention_head_size 形式：

# By proper transpose, we have proj of size [bs, num_attention_heads, seq_len, attention_head_size].
proj = rearrange(proj, 'b t h d -> b h t d')

$c.$ `attention` 模块

attention 模块实现了注意力头内部的核心计算逻辑。它包含以下的步骤：

实现注意力计算公式的下面部分：

\frac{QK^T}{\sqrt{d_k}}

# (B, nh, T, d_k) x (B, nh, d_k, T) -> (B, nh, T, T)
att = torch.matmul(query, key.transpose(-2, -1))
att = att / math.sqrt(query.size(-1))

使用因果遮罩和填充遮罩。我们创建一个上三角矩阵来实现因果遮罩：

seq_len = query.size(-2)
causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=query.device))
att = att.masked_fill(causal_mask == 0, float('-inf'))

if attention_mask is not None:
    att = att + attention_mask

完成公式的剩余部分：应用 softmax 层、执行 dropout，最后乘 $V$ ：

# Normalize the scores to get attention weights.
att = F.softmax(att, dim=-1)

# Apply dropout.
att = self.dropout(att)

# (B, nh, T, T) x (B, nh, T, d_k) -> (B, nh, T, d_k)
att = torch.matmul(att, value)

return att

在 [B, nh, T, T] 形状的张量中，dim = -1 正好代表“键序列长度”这个维度。沿着这个维度进行 softmax 运算就可以对于每一个查询词 i，都独立地计算出一个它对所有键词 j 的注意力权重分布。

$d.$ `forward` 模块

forward 模块是整个多头注意力层的入口，它负责编排和驱动 attention 模块，它包含如下步骤：

生成 $Q$ 、 $K$ 、 $V$ ：

key_layer = self.transform(hidden_states, self.key)
value_layer = self.transform(hidden_states, self.value)
query_layer = self.transform(hidden_states, self.query)

计算多头注意力：

ttn_value = self.attention(key_layer, query_layer, value_layer, attention_mask)

合并多头结果，将所有独立计算的头的输出拼接起来：

attn_value = rearrange(attn_value, 'b h t d -> b t (h d)')

2. GPT-2 层

Transformer 模型的核心思想就是对信息进行 $N$ 轮连续的、深度的加工。每一轮加工的逻辑都是完全一样的。而 GPT2Layer 就实现了基本的加工逻辑。这样，GPT2Model 就只需要调用组装这些逻辑来实现完整的模型。

GPT2Layer 和 GPT2Model 的关系就像是 “积木” 和 “用积木搭成的城堡” 的关系。GPT2Layer 就是一块功能强大、标准化的“处理积木”。而 GPT2Model 则是“总设计师”和“建筑结构”，它负责把这些积木组合起来，并处理城堡的“入口”（输入）和“出口”（输出）。

GPT2Layer 内部包含了自注意力机制（用来捕捉上下文关系）和前馈网络（用来提炼信息），能够接收一个序列的表示 hidden states，并输出一个经过了更深度理解的、新的序列表示。

$a.$ 初始化

首先，我们需要初始化搭建 Transformer Block 所需的所有模块（也就是nn.Module）：

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

注意力模块：根据 Transformer 架构，我们需要如下模块：
- 一个 CausalSelfAttention 实例，这是注意力计算的核心。
- 一个线性层，用于在多头注意力计算完后，将拼接起来的结果进行一次线性变换（对应公式中的 $W^O$ ）。
- 一个层归一化模块，用于在送入注意力子层之前稳定数据。
- 一个 Dropout 层，用于正则化。
前馈网络模块:
- FFN 的第一个线性层，通常会将维度从 hidden_size 扩大到 intermediate_size（通常是 4 * hidden_size）。
- FNN 的激活函数。
- FFN 的第二个线性层，将维度从 intermediate_size 缩减回 hidden_size。
- 另一个层归一化模块，用于在送入 FFN 子层之前稳定数据。
- 另一个 Dropout 层。这两部分和注意力部分类似

前馈网络模块就是实现Transformer非线性变换的模块。

FFN 升维的过程可以看作是将这些混合的特征“解耦”。在高维的 intermediate_size 空间里，模型可以让某些神经元专门负责识别特定的、更细粒度的模式。例如，一个神经元可能专门对“复数名词”这个语法特征激活，另一个可能对“带有积极情绪”的语义特征激活；而降维的过程则学习如何将这些在高维空间中被“点亮”的、有用的细粒度特征，重新组合成一个更有意义、信息更丰富的 hidden_size 向量。

self.self_attention = CausalSelfAttention(config)
# Add-norm for multi-head attention.
self.attention_dense = nn.Linear(config.hidden_size, config.hidden_size)
self.attention_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.attention_dropout = nn.Dropout(config.hidden_dropout_prob)

# Feed forward.
self.interm_dense = nn.Linear(config.hidden_size, config.intermediate_size)
self.interm_af = F.gelu

# Add-norm for feed forward.
self.out_dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.out_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.out_dropout = nn.Dropout(config.hidden_dropout_prob)

$b.$ `forward` 模块

forward 模块负责 Transformer 的前向传播逻辑，前向传播分为自注意力子层和前馈网络子层的传播，它们都遵循以下的逻辑：

保存初始输入用于残差连接。
进行归一化。
对归一化后的输入进行计算。该过程可能涉及升维。
进行线性变换或经过现行层，该过程可能涉及降维。
进行残差连接，得到输出。

具体到每个子层的逻辑如下：

自注意力子层：
1. 把输入 hidden_states 保存下来，用于稍后的残差连接。
2. 将输入进行层归一化。
3. 将归一化后的数据送入自注意力模块进行计算。
4. 对注意力输出进行最终的线性变换。
5. 残差连接
前馈网络层
1. 把输入 hidden_states 保存下来，用于稍后的残差连接。
2. 将输入进行归一化。
3. 将归一化后的数据送入 FFN，经过线性层 -> GeLU激活，将输入映射到高维空间
4. 经过 FFN 的第二个线性层，将 GeLU 激活后的和输入映射到原本维度的空间。
5. 残差连接。

residual = hidden_states
ln_output = self.attention_layer_norm(hidden_states)
att_output = self.self_attention(ln_output, attention_mask)
dense_output = self.attention_dense(att_output)
hidden_states = residual + self.attention_dropout(dense_output)

# Feed-forward sub-layer with pre-layer norm
residual = hidden_states
ln_output = self.out_layer_norm(hidden_states)
interm_output = self.interm_af(self.interm_dense(ln_output))
dense_output = self.out_dense(interm_output)
hidden_states = residual + self.out_dropout(dense_output)

3. 基础GPT模型

基础GPT模型由基类 GPTPreTrainedModel 构成，这个类是所有后续模型（这里是 GPT2Model）的父类。它不包含任何具体的模型层，只提供通用功能。

从零开始训练一个模型时，我们需要给模型的权重（nn.Linear, nn.Embedding 等）一个合理的初始值。糟糕的初始化会导致模型训练不稳定或无法收敛，因此我们在基类的初始化方法中合适地初始化这些参数：

线性层 nn.Linear 和嵌入层 nn.Embedding 的权重会从一个均值为 0、标准差为 config.initializer_range 的正态分布中采样。
层归一化 nn.LayerNorm 的权重被设置为 1，偏置被设置为 0，这使得它在训练开始时相当于一个“无操作”的层，有助于稳定训练。

class GPTPreTrainedModel(nn.Module):

  def __init__(self, config: PretrainedConfig, *inputs, **kwargs):
    super().__init__()
    self.config = config
    self.name_or_path = config.name_or_path

  def init_weights(self):
    # Initialize weights
    self.apply(self._init_weights)

  def _init_weights(self, module):
    """ Initialize the weights """
    if isinstance(module, (nn.Linear, nn.Embedding)):
      # Slightly different from the TF version which uses truncated_normal for initialization
      # cf https://github.com/pytorch/pytorch/pull/5617
      module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
    elif isinstance(module, nn.LayerNorm):
      module.bias.data.zero_()
      module.weight.data.fill_(1.0)
    if isinstance(module, nn.Linear) and module.bias is not None:
      module.bias.data.zero_()

4. GPT2模型

GPT2 模型搭建了模型的整体架构，整合了嵌入层、GPT2层堆叠、输出层。

$a.$ 初始化

我们在初始化中初始化嵌入层、GPT2层、输出层所需要的 torch.nn 组件：

# Embedding layers.
self.word_embedding = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.pos_embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.embed_dropout = nn.Dropout(config.hidden_dropout_prob)

# Register position_ids (1, len position emb) to buffer because it is a constant.
position_ids = torch.arange(config.max_position_embeddings).unsqueeze(0)
self.register_buffer('position_ids', position_ids)

# GPT-2 layers.
self.gpt_layers = nn.ModuleList([GPT2Layer(config) for _ in range(config.num_hidden_layers)])

# [CLS] token transformations.
# self.pooler_dense = nn.Linear(config.hidden_size, config.hidden_size)
# self.pooler_af = nn.Tanh()

nn.ModuleList 是在 PyTorch 中构建包含可变数量或重复子模块的模型的标准且唯一正确的方式。nn.ModuleList 会遍历传给它的列表 [GPT2Layer(...), GPT2Layer(...), ...]，并让父模块 GPT2Model 把它们登记在册。

# Final layer norm.
self.final_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.init_weights()

$b.$ `embed` 模块

词嵌入的流程如下：

通过 word_embedding 将 token ID 转换成词向量（这个词是什么）。
通过 pos_embedding 将位置 ID 转换成位置向量（这个词在哪里）。
将两者相加，得到一个既包含“身份信息”又包含“位置信息”的初始向量。
应用 Dropout，完成词嵌入模块的实现。

input_shape = input_ids.size()
seq_length = input_shape[1]

inputs_embeds = self.word_embedding(input_ids)
pos_ids = self.position_ids[:, :seq_length]
pos_embeds = self.pos_embedding(pos_ids)

embeddings = inputs_embeds + pos_embeds
embeddings = self.embed_dropout(embeddings)
return embeddings

$c.$ `encode` 模块

encode 模块通过一系列复杂的神经网络层，把嵌入层传入的词向量序列编码成一个富含上下文信息的、最终的向量表示，它包含如下的步骤：

获取自注意力掩码。之后这个掩码会传给一系列神经网络层的。
逐一应用初始好的Transformer层和前馈网络层。

def encode(self, hidden_states, attention_mask):
    extended_attention_mask: torch.Tensor = get_extended_attention_mask(attention_mask, self.dtype)

    # Pass the hidden states through the encoder layers.
    for i, layer_module in enumerate(self.gpt_layers):
      # Feed the encoding from the last bert_layer to the next.
      hidden_states = layer_module(hidden_states, extended_attention_mask)

    return hidden_states

$d.$ `forward` 模块

forward 模块定义了数据从输入到输出的完整“流水线”，它整合了以下的流程：

生成词嵌入向量。
通过神经网络层深度理解词嵌入向量。
对 Transformer 层的最终输出再进行一次层归一化，稳定训练结果。
提取最后一个有效 Token 的隐状态。

在很多任务（如文本分类）中，我们不需要序列中每个 token 的输出，而是需要一个能代表整个句子语义的向量。一种常见的做法就是取最后一个非填充 token 的隐状态。

# Get the embedding for each input token.
embedding_output = self.embed(input_ids=input_ids)

# Feed to a transformer (a stack of GPTLayers).
sequence_output = self.encode(embedding_output, attention_mask=attention_mask)
sequence_output = self.final_layer_norm(sequence_output)

# Get the hidden state of the final token.
# Ensure the index tensor is an integer type (long) for advanced indexing.
# Also clamp to >= 0 to avoid -1 when a sequence is fully padded.
last_non_pad_idx = attention_mask.long().sum(dim=1) - 1  # Subtract 1 to get last index
last_non_pad_idx = last_non_pad_idx.clamp(min=0)
last_token = sequence_output[torch.arange(sequence_output.shape[0]), last_non_pad_idx]

return {'last_hidden_state': sequence_output, 'last_token': last_token}

为什么不直接用 sequence_output[:, -1]？因为序列末尾很可能是无意义的 [PAD] token，直接取最后一个会得到错误的信息。

$e.$ 预训练模型加载

我们使用 @classmethod 装饰器包装的 from_pretrained 方法来加载预训练模型、而不是自己实现。我们将Huggingface上已经训练好的模型的参数权重迁移到我们的模型中，具体步骤如下：

加载我们的模型和Huggingface上预训练好的模型。
对于和原有预训练模型结构相同的部分，直接进行权重迁移即可：

gpt_model = OpenAIGPT2Model.from_pretrained(model).eval()
our_model = GPT2Model(GPT2Config(hidden_size=d, num_hidden_layers=l,
  num_attention_heads=num_heads,
  intermediate_size=d*4)).eval()

# Load word and positional embeddings.
our_model.word_embedding.load_state_dict(gpt_model.wte.state_dict())
our_model.pos_embedding.load_state_dict(gpt_model.wpe.state_dict())

# Remap the final layer norm values.
our_model.final_layer_norm.weight.data = gpt_model.state_dict()['ln_f.weight']
our_model.final_layer_norm.bias.data = gpt_model.state_dict()['ln_f.bias']

对于结构不同的部分，我们逐层进行权重重映射：

for i in range(l):
  l = our_model.gpt_layers[i]
  # Remap the Q,K,V weights from a conv1d to 3 linear projections
  l.self_attention.query.weight.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.weight'][:, :d].T
  l.self_attention.query.bias.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.bias'][:d]
  l.self_attention.key.weight.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.weight'][:, d:d*2].T
  l.self_attention.key.bias.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.bias'][d:d*2]
  l.self_attention.value.weight.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.weight'][:, d*2:].T
  l.self_attention.value.bias.data = gpt_model.state_dict()[f'h.{i}.attn.c_attn.bias'][d*2:]

  # Remap final dense layer in MHA.
  l.attention_dense.weight.data = gpt_model.state_dict()[f'h.{i}.attn.c_proj.weight'].T
  l.attention_dense.bias.data = gpt_model.state_dict()[f'h.{i}.attn.c_proj.bias']

  # Remap attention layer norm.
  l.attention_layer_norm.weight.data = gpt_model.state_dict()[f'h.{i}.ln_1.weight']
  l.attention_layer_norm.bias.data = gpt_model.state_dict()[f'h.{i}.ln_1.bias']

  # Remap post-attention MLP layers.
  l.interm_dense.weight.data = gpt_model.state_dict()[f'h.{i}.mlp.c_fc.weight'].T
  l.interm_dense.bias.data = gpt_model.state_dict()[f'h.{i}.mlp.c_fc.bias']
  l.out_dense.weight.data = gpt_model.state_dict()[f'h.{i}.mlp.c_proj.weight'].T
  l.out_dense.bias.data = gpt_model.state_dict()[f'h.{i}.mlp.c_proj.bias']

  # Remap second layer norm weights.
  l.out_layer_norm.weight.data = gpt_model.state_dict()[f'h.{i}.ln_2.weight']
  l.out_layer_norm.bias.data = gpt_model.state_dict()[f'h.{i}.ln_2.bias']

GPT2 实现笔记(1)

1. 注意力模块实现

a.a.a. 初始化

b.b.b. transform 模块

c.c.c. attention 模块

d.d.d. forward 模块

2. GPT-2 层

a.a.a. 初始化

b.b.b. forward 模块

3. 基础GPT模型

4. GPT2模型

a.a.a. 初始化

b.b.b. embed 模块

c.c.c. encode 模块

d.d.d. forward 模块

e.e.e. 预训练模型加载

Comments

$a.$ 初始化

$b.$ `transform` 模块

$c.$ `attention` 模块

$d.$ `forward` 模块

$a.$ 初始化

$b.$ `forward` 模块

$a.$ 初始化

$b.$ `embed` 模块

$c.$ `encode` 模块

$d.$ `forward` 模块

$e.$ 预训练模型加载