RNN（Recurrent Neural Network，循环神经网络）家族详解（RNN，LSTM，GRU）

文章目录

- 一、RNN基础：序列建模的核心思想
- - 1.1 RNN的本质与核心机制
  - 1.2 应用场景与结构分类
- 二、传统RNN：序列模型的起点
- - 2.1 内部结构与数学表达
  - 2.2 计算示例
  - 2.3 RNN在Pytorch中的API
  - 2.4 代码示例
  - 2.5 优缺点与梯度问题
- 三、LSTM：门控机制破解长期依赖
- - 3.1 四大门控机制详解
  - 3.2 "Bob"案例的LSTM完整计算示例
  - 3.3 LSTM在Pytorch中的API
  - 3.4 代码示例
  - 3.5 门控机制的数学本质
- 四、GRU：LSTM的轻量级进化
- - 4.1 双门控简化结构
  - 4.2 "Bob"案例的GRU完整计算过程
  - 4.3 GRU在Pytorch中的API
  - 3.4 代码示例
- 五、三大模型的对比与实践选择
- - 5.1 核心指标对比
  - 5.2 适用场景建议
  - 5.3 要点

循环神经网络（RNN）作为处理序列数据的核心模型，在自然语言处理、时间序列分析等领域发挥着关键作用。本文将系统梳理传统RNN、LSTM和GRU的内部机制、数学表达与实践应用，通过统一案例对比三者的计算过程，帮助读者深入理解序列模型的进化脉络。

一、RNN基础：序列建模的核心思想

在这里插入图片描述

1.1 RNN的本质与核心机制

RNN（Recurrent Neural Network）的核心创新在于"循环记忆"机制——将上一时间步的隐藏状态与当前输入结合，形成对序列依赖关系的建模能力。其本质是通过参数共享实现对时间维度的特征提取，数学上表现为：

$h_t = f(W \cdot [x_t, h_{t-1}] + b)$

$x_t$ ：当前时间步输入向量
$h_{t-1}$ ：上一时间步隐藏状态
$f$ ：激活函数（通常为tanh或sigmoid）

这种结构使得RNN能够捕捉序列中的短期依赖，例如判断"我爱吃苹果"中"苹果"的词性需依赖前文"吃"的动作。

1.2 应用场景与结构分类

RNN按输入输出结构可分为四类：

N vs N：输入输出等长，适用于语音识别中的帧级标注
N vs 1：序列输入→单一输出，典型如文本分类
1 vs N：单一输入→序列输出，常用于图片生成描述
N vs M：seq2seq架构，输入输出长度不限，是机器翻译的基础

二、传统RNN：序列模型的起点

结构图
在这里插入图片描述

2.1 内部结构与数学表达

传统RNN的计算流程可拆解为：

输入拼接： $x_t, h_{t-1}]$
线性变换： $\cdot [x_t, h_{t-1}] + b$
激活输出： $h_t = \tanh(\cdot)$

以一个3维输入、4维隐藏状态的RNN为例，其参数矩阵为：

输入权重 $W_{ih} \in \mathbb{R}^{4 \times 3}$
隐藏权重 $W_{hh} \in \mathbb{R}^{4 \times 4}$
偏置 $\in \mathbb{R}^4$

2.2 计算示例

人名"Bob"的特征提取
假设：

字符编码：‘B’=[0.1,0,0], ‘o’=[0,0.1,0], ‘b’=[0,0,0.1]
RNN参数： $W_{ih} = [[1,0,0],[0,1,0],[0,0,1],[1,1,1]]$ （简化示例）
$W_{hh} = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]$

计算步骤：

时间步1：输入’B’
$h_1 = \tanh(W_{ih} \cdot 'B' + W_{hh} \cdot h_0) = \tanh([0.1, 0, 0, 0.1])$
$h_1 = [0.099, 0, 0, 0.099]$ （假设h0全0）
时间步2：输入’o’
$h_2 = \tanh(W_{ih} \cdot 'o' + W_{hh} \cdot h_1)$
$\tanh([0, 0.1, 0, 0.1+0.099]) = [0, 0.099, 0, 0.197]$
时间步3：输入’b’
$h_3 = \tanh([0, 0, 0.1, 0.1+0.197]) = [0, 0, 0.099, 0.292]$

最终隐藏状态 $h_3$ 已经包含"Bob"的所有信息，即为"Bob"的序列特征表示。

2.3 RNN在Pytorch中的API

RNN类定义与核心参数

torch.nn.RNN(input_size,           # 输入特征维度hidden_size,          # 隐藏状态维度num_layers=1,         # 堆叠的RNN层数nonlinearity='tanh',  # 非线性激活函数 'tanh' 或 'relu'bias=True,            # 是否使用偏置batch_first=False,    # 输入格式是否为(batch, seq, feature)dropout=0,            # Dropout概率bidirectional=False   # 是否为双向RNN
)

输入与输出格式

输入参数：
- input：输入序列，形状为(seq_len, batch, input_size)（默认）或(batch, seq_len, input_size)（batch_first=True）
- $h_0$ ：初始隐藏状态，形状为(num_layers * num_directions, batch, hidden_size)
输出参数：
- output：所有时间步的隐藏状态，形状为(seq_len, batch, hidden_size * num_directions)
- $h_n$ ：最后一个时间步的隐藏状态，形状同 $h_0$
关键属性与方法

权重矩阵：
- $weight_ih_l[k]$ ：第k层的输入到隐藏的权重
- $weight_hh_l[k]$ ：第k层的隐藏到隐藏的权重
- $bias_ih_l[k]$ 和 $bias_hh_l[k]$ ：对应偏置
前向传播方法：

output, h_n = rnn(input, h_0)

2.4 代码示例

基本用法

import torch
import torch.nn as nn# 创建RNN模型
rnn = nn.RNN(input_size=10,      # 输入特征维度hidden_size=20,     # 隐藏状态维度num_layers=2,       # 2层RNN堆叠batch_first=True,   # 使用(batch, seq, feature)格式bidirectional=True  # 双向RNN
)# 准备输入
batch_size = 3
seq_len = 5
x = torch.randn(batch_size, seq_len, 10)  # 输入序列# 初始化隐藏状态（可选）
h0 = torch.zeros(2*2, batch_size, 20)  # 2层 * 双向# 前向传播
output, hn = rnn(x, h0)# 输出形状分析
print("Output shape:", output.shape)      # (3, 5, 40)  [batch, seq, hidden*2]
print("Final hidden shape:", hn.shape)    # (4, 3, 20)  [layers*directions, batch, hidden]

获取最后时间步的隐藏状态

# 方法1：从output中获取
last_output = output[:, -1, :]  # (batch, hidden*directions)# 方法2：从hn中获取
last_hidden = hn[-2:] if rnn.bidirectional else hn[-1]  # 双向时需拼接两个方向
last_hidden = torch.cat([last_hidden[0], last_hidden[1]], dim=1) if rnn.bidirectional else last_hidden

2.5 优缺点与梯度问题

优势：结构简单，参数量少，短序列计算效率高
致命缺陷：长序列中梯度消失严重，如：
梯度连乘公式： $\nabla = \prod_{i=1}^n \sigma'(z_i) \cdot w_i$
当 $w_i < 1$ 时，连乘导致梯度趋近于0，无法更新远层参数

三、LSTM：门控机制破解长期依赖

在这里插入图片描述

3.1 四大门控机制详解

LSTM通过引入门控系统，将传统RNN的单一隐藏状态拆分为：

细胞状态C：长期记忆载体
隐藏状态h：短期特征表示

核心公式组：

遗忘门：决定丢弃历史信息
- 功能：决定丢弃细胞状态中的哪些历史信息。
- 计算过程：
  - 输入当前输入 $x_t$ 和上一时刻隐藏状态 $h_{t-1}$ ，拼接后通过全连接层
  - $f_t$ 是0到1之间的门值，1表示“完全保留”，0表示“完全遗忘”。
输入门：筛选新信息
- 功能：决定当前输入的新信息中哪些需要存储到细胞状态。
- 计算过程：
  - 生成输入门门值 $i_t$ （类似遗忘门，通过sigmoid激活）：
  - 生成候选细胞状态 $\tilde{C}_t$ （通过tanh激活）：
细胞状态更新：
- 功能：存储长期记忆，通过门控机制更新。
- 更新过程：
  - $f_t * C_{t-1}$ ：遗忘门作用于旧细胞状态，丢弃部分历史信息；
  - $i_t *\tilde{C}_t$ ：输入门筛选新信息并与候选状态结合。
输出门：生成当前隐藏状态
- 功能：决定细胞状态中的哪些信息作为当前隐藏状态输出。
- 计算过程：
  - 生成输出门门值 $o_t$
  - 细胞状态通过tanh激活后，与输出门值相乘得到隐藏状态

3.2 "Bob"案例的LSTM完整计算示例

为便于与RNN对比，我们保持输入维度、隐藏状态维度一致，并使用相似的参数设置：

假设条件：

输入维度=3，隐藏状态维度=4
字符编码：‘B’=[0.1,0,0], ‘o’=[0,0.1,0], ‘b’=[0,0,0.1]

LSTM参数（简化后）：

W_f = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]
W_i = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]
W_c = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]
W_o = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]

（注：每个权重矩阵实际为4x7，因拼接

h_{t-1}

(4维)和

x_t

(3维)）

详细计算过程：

时间步1：输入 ‘B’ = [0.1, 0, 0]

遗忘门计算：
$f_1 = \sigma(W_f \cdot [h_0, 'B'] + b_f)$
假设 $b_f=[0,0,0,0]$ ，则：
$W_f \cdot [h_0, 'B'] = [[1,0,0,1,0,0,0], ...] \cdot [0,0,0,0,0.1,0,0]^T = [0.1, 0, 0, 0]$
$f_1 = \sigma([0.1, 0, 0, 0]) = [0.525, 0.5, 0.5, 0.5]$
输入门计算：
$i_1 = \sigma(W_i \cdot [h_0, 'B'] + b_i) = [0.525, 0.5, 0.5, 0.5]$
$\tilde{C}_1 = \tanh(W_c \cdot [h_0, 'B'] + b_c) = \tanh([0.1, 0, 0, 0]) = [0.099, 0, 0, 0]$
细胞状态更新：
$C_1 = f_1 \odot C_0 + i_1 \odot \tilde{C}_1$
$\odot [0,0,0,0] + [0.525, 0.5, 0.5, 0.5] \odot [0.099, 0, 0, 0]$
$= [0.052, 0, 0, 0]$
输出门与隐藏状态：
$o_1 = \sigma(W_o \cdot [h_0, 'B'] + b_o) = [0.525, 0.5, 0.5, 0.5]$
$h_1 = o_1 \odot \tanh(C_1) = [0.525, 0.5, 0.5, 0.5] \odot [0.052, 0, 0, 0] = [0.027, 0, 0, 0]$

时间步2：输入 ‘o’ = [0, 0.1, 0]

遗忘门计算：
$W_f \cdot [h_1, 'o'] = [[1,0,0,1,0,0,0], ...] \cdot [0.027,0,0,0,0,0.1,0]^T = [0.027, 0.1, 0, 0]$
$f_2 = \sigma([0.027, 0.1, 0, 0]) = [0.507, 0.525, 0.5, 0.5]$
输入门计算：
$i_2 = \sigma([0.027, 0.1, 0, 0]) = [0.507, 0.525, 0.5, 0.5]$
$\tilde{C}_2 = \tanh([0.027, 0.1, 0, 0]) = [0.027, 0.099, 0, 0]$
细胞状态更新：
$C_2 = f_2 \odot C_1 + i_2 \odot \tilde{C}_2$
$= [0.026, 0, 0, 0] + [0.014, 0.052, 0, 0] = [0.04, 0.052, 0, 0]$
输出门与隐藏状态：
$o_2 = \sigma([0.027, 0.1, 0, 0]) = [0.507, 0.525, 0.5, 0.5]$
$h_2 = o_2 \odot \tanh(C_2) = [0.507, 0.525, 0.5, 0.5] \odot [0.04, 0.052, 0, 0] = [0.02, 0.027, 0, 0]$

时间步3：输入 ‘b’ = [0, 0, 0.1]

遗忘门计算：
$W_f \cdot [h_2, 'b'] = [0.02, 0.027, 0.1, 0]$
$f_3 = \sigma([0.02, 0.027, 0.1, 0]) = [0.505, 0.507, 0.525, 0.5]$
输入门计算：
$i_3 = \sigma([0.02, 0.027, 0.1, 0]) = [0.505, 0.507, 0.525, 0.5]$
$\tilde{C}_3 = \tanh([0.02, 0.027, 0.1, 0]) = [0.02, 0.027, 0.099, 0]$
细胞状态更新：
$C_3 = f_3 \odot C_2 + i_3 \odot \tilde{C}_3$
$= [0.02, 0.026, 0, 0] + [0.01, 0.014, 0.052, 0] = [0.03, 0.04, 0.052, 0]$
输出门与隐藏状态：
$o_3 = \sigma([0.02, 0.027, 0.1, 0]) = [0.505, 0.507, 0.525, 0.5]$
$h_3 = o_3 \odot \tanh(C_3) = [0.015, 0.02, 0.027, 0]$

最终结果对比

模型	"Bob"的特征表示（最终隐藏状态）
RNN	[0, 0, 0.099, 0.292]
LSTM	[0.015, 0.02, 0.027, 0]

关键差异分析：

信息保留方式：
- RNN直接累加历史信息，导致后期输入权重过大（如’b’的影响占主导）
- LSTM通过门控机制平衡了各字符的影响，保留了更均衡的特征表示
梯度传递能力：
- RNN的梯度依赖 $\tanh$ 导数（最大值为1），易衰减
- LSTM的细胞状态通过 $f_t$ （接近1）传递梯度，避免消失

3.3 LSTM在Pytorch中的API

LSTM类定义与核心参数

torch.nn.LSTM(input_size,           # 输入特征维度hidden_size,          # 隐藏状态维度num_layers=1,         # 堆叠的LSTM层数bias=True,            # 是否使用偏置batch_first=False,    # 输入格式是否为(batch, seq, feature)dropout=0,            # Dropout概率bidirectional=False   # 是否为双向LSTM
)

输入与输出格式

输入参数：
- input：输入序列，形状为(seq_len, batch, input_size)（默认）或(batch, seq_len, input_size)（batch_first=True）
- $h_0$ ：初始隐藏状态，形状为(num_layers * num_directions, batch, hidden_size)
- $c_0$ ：初始细胞状态，形状为(num_layers * num_directions, batch, hidden_size)
输出参数：
- output：所有时间步的隐藏状态，形状为(seq_len, batch, hidden_size * num_directions)
- $h_n$ ：最后一个时间步的隐藏状态，形状同h_0
- $c_n$ ：最后一个时间步的细胞状态，形状同c_0
关键属性与方法

权重矩阵：
- $weight_ih_l[k]$ ：第k层的输入到隐藏的权重（4个门合并）
- $weight_hh_l[k]$ ：第k层的隐藏到隐藏的权重（4个门合并）
- $bias_ih_l[k]$ 和 $bias_hh_l[k]$ ：对应偏置
前向传播方法

output, (h_n, c_n) = lstm(input, (h_0, c_0))

3.4 代码示例

基本用法

import torch
import torch.nn as nn# 创建LSTM模型
lstm = nn.LSTM(input_size=10,      # 输入特征维度hidden_size=20,     # 隐藏状态维度num_layers=2,       # 2层LSTM堆叠batch_first=True,   # 使用(batch, seq, feature)格式bidirectional=True  # 双向LSTM
)# 准备输入
batch_size = 3
seq_len = 5
x = torch.randn(batch_size, seq_len, 10)  # 输入序列# 初始化隐藏状态和细胞状态（可选）
h0 = torch.zeros(2*2, batch_size, 20)  # 2层 * 双向
c0 = torch.zeros(2*2, batch_size, 20)# 前向传播
output, (hn, cn) = lstm(x, (h0, c0))# 输出形状分析
print("Output shape:", output.shape)      # (3, 5, 40)  [batch, seq, hidden*2]
print("Final hidden shape:", hn.shape)    # (4, 3, 20)  [layers*directions, batch, hidden]
print("Final cell shape:", cn.shape)      # (4, 3, 20)

获取最后时间步的隐藏状态

# 方法1：从output中获取
last_output = output[:, -1, :]  # (batch, hidden*directions)# 方法2：从hn中获取
last_hidden = hn[-2:] if lstm.bidirectional else hn[-1]  # 双向时需拼接两个方向
last_hidden = torch.cat([last_hidden[0], last_hidden[1]], dim=1) if lstm.bidirectional else last_hidden

3.5 门控机制的数学本质

LSTM通过线性组合（ $C_t = f_tC_{t-1} + i_t\tilde{C}_t$ ）实现梯度的"直连"传播，避免了传统RNN的连乘衰减，数学上表现为：
$\frac{\partial C_t}{\partial C_{t-1}} = f_t$
当 $f_t$ 接近1时，梯度可近乎无损地传递至远层，这是LSTM解决长期依赖的核心。

四、GRU：LSTM的轻量级进化

在这里插入图片描述

4.1 双门控简化结构

GRU将LSTM的四门结构简化为：

更新门：控制历史信息保留比例
$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$
重置门：控制历史信息遗忘程度
$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$

核心公式：

候选状态： $\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)$
状态更新： $h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

4.2 "Bob"案例的GRU完整计算过程

为便于与RNN和LSTM对比，我们保持相同的输入维度、隐藏状态维度，并使用相似的参数设置：

假设条件：

输入维度=3，隐藏状态维度=4
字符编码：‘B’=[0.1,0,0], ‘o’=[0,0.1,0], ‘b’=[0,0,0.1]

GRU参数（简化后）：

W_z = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]
W_r = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]
W_h = [[1,0,0,1,0,0,0], [0,1,0,0,1,0,0], [0,0,1,0,0,1,0], [0,0,0,1,0,0,1]]

（注：每个权重矩阵实际为4x7，因拼接

h_{t-1}

(4维)和

x_t

(3维)）

详细计算过程：

时间步1：输入 ‘B’ = [0.1, 0, 0]

更新门计算：
$z_1 = \sigma(W_z \cdot [h_0, 'B'] + b_z)$
假设 $b_z=[0,0,0,0]$ ，则：
$W_z \cdot [h_0, 'B'] = [[1,0,0,1,0,0,0], ...] \cdot [0,0,0,0,0.1,0,0]^T = [0.1, 0, 0, 0]$
$z_1 = \sigma([0.1, 0, 0, 0]) = [0.525, 0.5, 0.5, 0.5]$
重置门计算：
$r_1 = \sigma(W_r \cdot [h_0, 'B'] + b_r) = [0.525, 0.5, 0.5, 0.5]$
候选隐藏状态：
$\tilde{h}_1 = \tanh(W_h \cdot [r_1 \odot h_0, 'B'] + b_h)$
$\tanh([0.1, 0, 0, 0]) = [0.099, 0, 0, 0]$
最终隐藏状态：
$h_1 = (1-z_1) \odot h_0 + z_1 \odot \tilde{h}_1$
$\odot [0,0,0,0] + [0.525, 0.5, 0.5, 0.5] \odot [0.099, 0, 0, 0]$
$= [0.052, 0, 0, 0]$

时间步2：输入 ‘o’ = [0, 0.1, 0]

更新门计算：
$W_z \cdot [h_1, 'o'] = [[1,0,0,1,0,0,0], ...] \cdot [0.052,0,0,0,0,0.1,0]^T = [0.052, 0.1, 0, 0]$
$z_2 = \sigma([0.052, 0.1, 0, 0]) = [0.513, 0.525, 0.5, 0.5]$
重置门计算：
$r_2 = \sigma([0.052, 0.1, 0, 0]) = [0.513, 0.525, 0.5, 0.5]$
候选隐藏状态：
$\tilde{h}_2 = \tanh(W_h \cdot [r_2 \odot h_1, 'o'] + b_h)$
$\tanh([0.027, 0.1, 0, 0]) = [0.027, 0.099, 0, 0]$
最终隐藏状态：
$h_2 = (1-z_2) \odot h_1 + z_2 \odot \tilde{h}_2$
$= [0.026, 0, 0, 0] + [0.014, 0.05, 0, 0] = [0.04, 0.05, 0, 0]$

时间步3：输入 ‘b’ = [0, 0, 0.1]

更新门计算：
$W_z \cdot [h_2, 'b'] = [0.04, 0.05, 0.1, 0]$
$z_3 = \sigma([0.04, 0.05, 0.1, 0]) = [0.51, 0.512, 0.525, 0.5]$
重置门计算：
$r_3 = \sigma([0.04, 0.05, 0.1, 0]) = [0.51, 0.512, 0.525, 0.5]$
候选隐藏状态：
$\tilde{h}_3 = \tanh(W_h \cdot [r_3 \odot h_2, 'b'] + b_h)$
$\tanh([0.02, 0.026, 0.1, 0]) = [0.02, 0.026, 0.099, 0]$
最终隐藏状态：
$h_3 = (1-z_3) \odot h_2 + z_3 \odot \tilde{h}_3$
$= [0.02, 0.025, 0, 0] + [0.01, 0.013, 0.052, 0] = [0.03, 0.038, 0.052, 0]$

三种模型的最终特征表示对比

模型	"Bob"的特征表示（最终隐藏状态）
RNN	[0, 0, 0.099, 0.292]
LSTM	[0.015, 0.02, 0.027, 0]
GRU	[0.03, 0.038, 0.052, 0]

关键差异分析：

信息融合方式：
- RNN直接累加输入，导致后期信息主导
- LSTM通过细胞状态长期记忆，但计算复杂
- GRU通过更新门动态平衡历史与当前信息，计算效率更高
参数效率：
- GRU参数量约为LSTM的2/3，训练速度更快
- 在短序列任务中，GRU通常能达到与LSTM接近的性能

4.3 GRU在Pytorch中的API

GRU类定义与核心参数

torch.nn.GRU(input_size,           # 输入特征维度hidden_size,          # 隐藏状态维度num_layers=1,         # 堆叠的GRU层数bias=True,            # 是否使用偏置batch_first=False,    # 输入格式是否为(batch, seq, feature)dropout=0,            # Dropout概率bidirectional=False   # 是否为双向GRU
)

输入与输出格式
输入参数：
- input：输入序列，形状为(seq_len, batch, input_size)（默认）或(batch, seq_len, input_size)（batch_first=True）
- h_0：初始隐藏状态，形状为(num_layers * num_directions, batch, hidden_size)
输出参数：
- output：所有时间步的隐藏状态，形状为(seq_len, batch, hidden_size * num_directions)
- h_n：最后一个时间步的隐藏状态，形状同h_0
关键属性与方法
权重矩阵：
- weight_ih_l[k]：第k层的输入到隐藏的权重（重置门和更新门合并）
- weight_hh_l[k]：第k层的隐藏到隐藏的权重
- bias_ih_l[k] 和 bias_hh_l[k]：对应偏置
前向传播方法：

output, h_n = gru(input, h_0)  # 与LSTM相比，少了细胞状态c_n

3.4 代码示例

基本用法

import torch
import torch.nn as nn# 创建GRU模型
gru = nn.GRU(input_size=10,      # 输入特征维度hidden_size=20,     # 隐藏状态维度num_layers=2,       # 2层GRU堆叠batch_first=True,   # 使用(batch, seq, feature)格式bidirectional=True  # 双向GRU
)# 准备输入
batch_size = 3
seq_len = 5
x = torch.randn(batch_size, seq_len, 10)  # 输入序列# 初始化隐藏状态（可选）
h0 = torch.zeros(2*2, batch_size, 20)  # 2层 * 双向# 前向传播
output, hn = gru(x, h0)# 输出形状分析
print("Output shape:", output.shape)      # (3, 5, 40)  [batch, seq, hidden*2]
print("Final hidden shape:", hn.shape)    # (4, 3, 20)  [layers*directions, batch, hidden]

获取最后时间步的隐藏状态

# 方法1：从output中获取
last_output = output[:, -1, :]  # (batch, hidden*directions)# 方法2：从hn中获取
last_hidden = hn[-2:] if gru.bidirectional else hn[-1]  # 双向时需拼接两个方向
last_hidden = torch.cat([last_hidden[0], last_hidden[1]], dim=1) if gru.bidirectional else last_hidden