整理：【动手学深度学习·第四篇】卷积神经网络：从 LeNet 到 ResNet，感受野、池化、残差连接的设计逻辑全讲透

这两天一直在研究这个话题，踩了几个坑，把遇到的东西整理成文，供有需要的朋友参考。

【动手学深度学习·第四篇】卷积神经网络：从 LeNet 到 ResNet，感受野、池化、残差连接的设计逻辑全讲透

---

🔥 本篇目标：第三篇的 MLP 把 Fashion-MNIST 做到了 92%，但它对图像的理解方式是把 784 个像素当成独立特征——完全无视了图像的空间结构。本篇引入卷积操作，从原理开始讲清楚参数共享和局部连接为什么对图像有效，手推感受野计算，实现 LeNet，再一步步演进到 ResNet——重点讲残差连接为什么能让 152 层网络比 34 层更容易训练。最终把准确率推到 95%+。

系列进度

篇次主题状态第一篇从 NumPy 到自动微分：张量、广播、链式法则✅ 已发布第二篇线性模型与优化：线性回归、Softmax、DataLoader✅ 已发布第三篇多层感知机：激活函数、反向传播、Dropout、BatchNorm✅ 已发布第四篇（本篇）卷积神经网络：LeNet → ResNet 演进—第五篇循环神经网络：LSTM、GRU、语言模型即将发布第六篇留意力机制与 Transformer：Self-Attention 到 BERT即将发布第七篇现代训练技巧：Adam、混合精度、学习率调度即将发布第八篇完整实战：从零训练图像分类器即将发布

一、MLP 处理图像的根本缺陷
二、卷积操作：参数共享与局部连接
三、卷积的关键参数：stride、padding、dilation
四、感受野：深层卷积如何"看到"全图
五、池化层：降采样与平移不变性
六、LeNet：第一个成功的 CNN
七、现代 CNN 的改进：AlexNet 的设计决策
八、残差连接：为什么深层网络反而更难训练
九、ResNet 完整实现
十、完整实战：Fashion-MNIST 准确率到 95%+
十一、面试高频问题

一、MLP 处理图像的根本缺陷

1.1 两个结构性问题

import torch
import torch.nn as nn

# Fashion-MNIST：28×28 灰度图
# MLP 的处理方式：展平为 784 维向量
flatten = nn.Flatten()
img     = torch.randn(1, 1, 28, 28)   # (batch, channel, H, W)
vec     = flatten(img)
print(vec.shape)   # (1, 784)

# 问题1：参数量爆炸
# 784 → 512 的线性层：784 × 512 = 401,408 个参数
# 对于 224×224 的彩色图：3×224×224 = 150,528 维
# 第一层 150,528 × 512 = 77,070,336 个参数！光第一层就 7700 万

# 问题2：丢失空间结构
# 像素 (i,j) 和 (i,j+1) 是相邻的，有强相关性
# 展平后这种相邻关系消失了
# 把图像旋转 90 度，展平后向量完全不同
# 但 MLP 会把它当成全新的输入来处理

# 核心洞察：图像有两个天然属性
# ① 局部性：相邻像素高度相关，边缘/纹理等特征是局部的
# ② 平移不变性：猫在图像左上角和右下角，应该给出同样的特征响应

1.2 卷积的解决方案

MLP：每个输出神经元连接所有输入像素（全连接）
     参数量 = 输入维度 × 输出维度

CNN：每个输出神经元只连接局部窗口内的像素（局部连接）
     同一个卷积核在整张图像上滑动（参数共享）
     参数量 = 卷积核大小 × 输入通道数 × 输出通道数
             （与图像尺寸无关！）

优势：
  ① 局部连接：利用图像的局部相关性，减少参数
  ② 参数共享：同一卷积核在所有位置检测同一种特征
     （在左上角检测边缘的参数，和在右下角检测边缘用的是同一组参数）
  ③ 天然的平移等变性（平移不变性）

二、卷积操作：参数共享与局部连接

2.1 2D 卷积的计算过程

import torch
import torch.nn.functional as F

# 手动实现 2D 卷积（单通道，无 padding）
def conv2d_manual(input_map, kernel):
    """
    input_map: (H, W)
    kernel:    (kH, kW)
    output:    (H-kH+1, W-kW+1)
    """
    H, W   = input_map.shape
    kH, kW = kernel.shape
    out_H  = H - kH + 1
    out_W  = W - kW + 1
    output = torch.zeros(out_H, out_W)

    for i in range(out_H):
        for j in range(out_W):
            # 取局部窗口，和卷积核做元素积再求和
            patch        = input_map[i:i+kH, j:j+kW]
            output[i, j] = (patch * kernel).sum()

    return output

# 水平边缘检测卷积核
edge_kernel = torch.tensor([
    [-1., -1., -1.],
    [ 0.,  0.,  0.],
    [ 1.,  1.,  1.],
])

# 测试图像（模拟上半部分亮、下半部分暗）
test_img = torch.zeros(6, 6)
test_img[:3, :] = 1.0   # 上半亮

result = conv2d_manual(test_img, edge_kernel)
print("输入图像：")
print(test_img)
print("\n卷积结果（水平边缘检测）：")
print(result)
# 中间行会有强响应（±3），上下均匀区域响应为 0

# PyTorch 的卷积：
# 注意 PyTorch 实际做的是 cross-correlation（不翻转卷积核）
# 在深度学习中通常就叫"卷积"，但严格数学定义是 cross-correlation
img    = test_img.unsqueeze(0).unsqueeze(0)     # (1, 1, 6, 6)
kernel = edge_kernel.unsqueeze(0).unsqueeze(0)  # (1, 1, 3, 3)
out_pt = F.conv2d(img, kernel)
print("\nPyTorch conv2d 结果：")
print(out_pt.squeeze())   # 与手动实现一致

2.2 多通道卷积

# 实际的 CNN：输入多通道（如 RGB），输出多通道（多个特征图）

# 输入：(batch, C_in, H, W)
# 卷积核：(C_out, C_in, kH, kW)
# 输出：(batch, C_out, H_out, W_out)

# 每个输出通道 = 对应卷积核（形状 C_in × kH × kW）与输入的卷积之和

batch_size, C_in, H, W = 2, 3, 28, 28
C_out  = 16
kH, kW = 3, 3

# nn.Conv2d 封装了多通道卷积
conv = nn.Conv2d(
    in_channels  = C_in,
    out_channels = C_out,
    kernel_size  = kH,       # 也可以写 (kH, kW) 或整数（正方形）
    stride       = 1,
    padding      = 1,        # 使输出尺寸 = 输入尺寸（same padding）
    bias         = True,
)

x   = torch.randn(batch_size, C_in, H, W)
out = conv(x)
print(f"输入形状：{x.shape}")    # (2, 3, 28, 28)
print(f"输出形状：{out.shape}")  # (2, 16, 28, 28)

# 参数量：C_out × C_in × kH × kW + C_out（bias）
params = C_out * C_in * kH * kW + C_out
print(f"参数量：{params}")       # 16 × 3 × 3 × 3 + 16 = 448（远小于全连接）

# 对比：如果用全连接处理同样的任务
# 输入 3×28×28=2352，输出 16×28×28=12544
# 参数量 = 2352 × 12544 = 29,503,488（约 3000 万！）

三、卷积的关键参数：stride、padding、dilation

3.1 输出尺寸公式

⌊

padding

−

dilation

(

kernel_size

−

)

−

stride

⌋

H_{out} = \left\lfloor \frac{H_{in} + 2 \times \text{padding} - \text{dilation} \times (\text{kernel\_size} - 1) - 1}{\text{stride}} + 1 \right\rfloor

Hout=⌊strideHin+2×padding−dilation×(kernel_size−1)−1+1⌋

def conv_output_size(H_in, kernel_size, stride=1, padding=0, dilation=1):
    """计算卷积后的输出尺寸"""
    return (H_in + 2*padding - dilation*(kernel_size-1) - 1) // stride + 1

# 常用配置
print("常用配置的输出尺寸（输入 28×28）：")
print(f"  3×3, s=1, p=0: {conv_output_size(28, 3, 1, 0)} × {conv_output_size(28, 3, 1, 0)}")  # 26×26
print(f"  3×3, s=1, p=1: {conv_output_size(28, 3, 1, 1)} × {conv_output_size(28, 3, 1, 1)}")  # 28×28 (same)
print(f"  3×3, s=2, p=1: {conv_output_size(28, 3, 2, 1)} × {conv_output_size(28, 3, 2, 1)}")  # 14×14 (halved)
print(f"  1×1, s=1, p=0: {conv_output_size(28, 1, 1, 0)} × {conv_output_size(28, 1, 1, 0)}")  # 28×28

3.2 三个参数的作用

# ── stride（步长）：控制滑动步长，影响输出尺寸 ──────────────
# stride=1：输出接近输入大小（不降采样）
# stride=2：输出约为输入的一半（降采样，替代 MaxPool）
conv_s2 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
out_s2  = conv_s2(torch.randn(1, 3, 28, 28))
print(f"stride=2: {out_s2.shape}")   # (1, 16, 14, 14)

# ── padding（填充）：在输入周围填充 0 ──────────────────────
# padding=0：输出比输入小（kernel_size-1）个像素
# padding=kernel_size//2：输出与输入同尺寸（same padding，仅当 stride=1）
conv_same = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
out_same  = conv_same(torch.randn(1, 3, 28, 28))
print(f"same padding: {out_same.shape}")   # (1, 16, 28, 28)

# ── 1×1 卷积（Pointwise Convolution）──────────────────────
# 不改变 H、W，只改变通道数
# 作用：通道间线性组合（降维/升维），计算量极小
conv_1x1 = nn.Conv2d(64, 32, kernel_size=1)   # 64 → 32 通道
out_1x1  = conv_1x1(torch.randn(1, 64, 14, 14))
print(f"1×1 conv: {out_1x1.shape}")   # (1, 32, 14, 14)
# 参数量：64×32×1×1 = 2,048（非常少）

# ── dilation（空洞卷积）：扩大感受野而不增加参数 ───────────
# dilation=2：卷积核内部插入空洞，3×3 核的感受野变成 5×5
conv_dil = nn.Conv2d(3, 16, kernel_size=3, dilation=2, padding=2)
out_dil  = conv_dil(torch.randn(1, 3, 28, 28))
print(f"dilation=2: {out_dil.shape}")   # (1, 16, 28, 28)
# 等效感受野：(3-1)*2+1 = 5（感受野变 5×5，但参数还是 3×3）

四、感受野：深层卷积如何"看到"全图

4.1 感受野的定义与计算

感受野（Receptive Field）：输出特征图上某一个像素，对应输入图像的哪个区域。

def calculate_receptive_field(layers: list[dict]) -> list[int]:
    """
    计算每层之后的有效感受野

    layers: [{"type": "conv/pool", "kernel": k, "stride": s, "dilation": d}]
    """
    rf     = 1   # 初始感受野（单个像素）
    stride = 1   # 累积步长

    rfs = [1]
    for layer in layers:
        k = layer.get("kernel", 1)
        s = layer.get("stride",  1)
        d = layer.get("dilation", 1)

        # 等效卷积核大小（考虑 dilation）
        k_eff = d * (k - 1) + 1

        # 感受野递推公式：RF_new = RF_old + (k_eff - 1) * stride_cumulative
        rf     = rf + (k_eff - 1) * stride
        stride = stride * s
        rfs.append(rf)

    return rfs

# LeNet-5 架构的感受野增长
lenet_layers = [
    {"type": "conv", "kernel": 5, "stride": 1},   # Conv1
    {"type": "pool", "kernel": 2, "stride": 2},   # Pool1
    {"type": "conv", "kernel": 5, "stride": 1},   # Conv2
    {"type": "pool", "kernel": 2, "stride": 2},   # Pool2
]

rfs = calculate_receptive_field(lenet_layers)
print("LeNet 各层感受野：")
print(f"  输入：    {rfs[0]}×{rfs[0]}")
print(f"  Conv1后： {rfs[1]}×{rfs[1]}")
print(f"  Pool1后： {rfs[2]}×{rfs[2]}")
print(f"  Conv2后： {rfs[3]}×{rfs[3]}")
print(f"  Pool2后： {rfs[4]}×{rfs[4]}")
# 输入：    1×1
# Conv1后： 5×5
# Pool1后： 6×6（池化后每步对应输入的 2 像素，感受野 +2）
# Conv2后： 14×14（在 pool1 输出上 5×5 卷积，感受野扩大）
# Pool2后： 16×16

# 深层网络 vs 浅层大核：
# 两个 3×3 卷积的感受野 = 5×5（一个 5×5 卷积）
# 但参数量：2×(3×3) = 18  vs  5×5 = 25
# 而且两个 3×3 有两个激活层，表达能力更强！
two_3x3_layers = [
    {"type": "conv", "kernel": 3},
    {"type": "conv", "kernel": 3},
]
one_5x5_layers = [{"type": "conv", "kernel": 5}]

print(f"\n两个 3×3: 感受野 = {calculate_receptive_field(two_3x3_layers)[-1]}")  # 5
print(f"一个 5×5: 感受野 = {calculate_receptive_field(one_5x5_layers)[-1]}")   # 5
# 感受野相同，但 2×3×3 参数更少、非线性更强

4.2 感受野的直觉

深度 CNN 的感受野增长示意（输入 224×224）：

层数    感受野
1       3×3    ← 只能看到局部纹理（边缘、颜色）
3       7×7    ← 能看到简单形状（角、圆弧）
5       15×15  ← 能看到局部物体部件（眼睛、轮子）
10      32×32  ← 能看到物体整体
20      100+   ← 可以看到全图，理解场景

这就是为什么深层 CNN 能做图像分类：
  浅层：检测边缘和纹理
  中层：组合为部件（耳朵、车门）
  深层：组合为完整物体（猫、汽车）

五、池化层：降采样与平移不变性

5.1 MaxPool vs AvgPool

# MaxPool2d：取窗口内最大值
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)   # 尺寸减半
x = torch.tensor([[[[1., 2., 3., 4.],
                    [5., 6., 7., 8.],
                    [9., 10., 11., 12.],
                    [13., 14., 15., 16.]]]])
out_max = max_pool(x)
print("MaxPool2d 输出：")
print(out_max)
# tensor([[[[ 6.,  8.],
#           [14., 16.]]]])
# 每个 2×2 窗口取最大值

# AvgPool2d：取窗口内平均值
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
out_avg  = avg_pool(x)
print("AvgPool2d 输出：")
print(out_avg)
# tensor([[[[ 3.5,  5.5],
#           [11.5, 13.5]]]])

# GlobalAvgPool：把整个特征图压缩为单个值（每个通道）
# 现代 CNN 的标准做法，替代全连接层前的 Flatten
gap = nn.AdaptiveAvgPool2d((1, 1))   # 输出固定 1×1
feat_map = torch.randn(4, 512, 7, 7) # (batch, C, H, W)
out_gap  = gap(feat_map)
print(f"\nGlobalAvgPool: {feat_map.shape} → {out_gap.shape}")  # (4, 512, 1, 1)
# squeeze 后 → (4, 512)，可以直接接分类头

5.2 池化的作用

MaxPool 的两个核心作用：

① 降采样（Subsampling）：
   H×W → (H/2)×(W/2)，减少后续层的计算量
   每层下采样 2×，经过 5 层后 224 → 7（ResNet 的设计）

② 平移不变性（Translation Invariance）：
   特征在局部窗口内移动 1 个像素，MaxPool 的输出不变
   让模型对目标的精确位置不那么敏感

MaxPool vs stride=2 的卷积（现代趋势）：
  传统：MaxPool 降采样
  现代（ResNet/EfficientNet）：stride=2 的卷积降采样
  原因：stride 卷积可学习，能保留更多信息
       MaxPool 硬编码取最大值，没有可学习参数

六、LeNet：第一个成功的 CNN

LeNet-5（LeCun et al., 1998）是第一个在实际任务（手写数字识别）上成功应用的 CNN，奠定了现代 CNN 的基本架构范式。

class LeNet5(nn.Module):
    """
    LeNet-5 的现代实现版本
    原版使用 Sigmoid 激活和 AvgPool，现代版改用 ReLU 和 MaxPool
    输入：(batch, 1, 28, 28)
    """

    def __init__(self, num_classes: int = 10):
        super().__init__()

        # 特征提取部分（卷积层）
        self.features = nn.Sequential(
            # Block 1：1→6 通道，28×28 → 14×14
            nn.Conv2d(1, 6, kernel_size=5, padding=2),  # (1,28,28)→(6,28,28)
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2),                  # (6,28,28)→(6,14,14)

            # Block 2：6→16 通道，14×14 → 5×5
            nn.Conv2d(6, 16, kernel_size=5),            # (6,14,14)→(16,10,10)
            nn.ReLU(),
            nn.MaxPool2d(2, stride=2),                  # (16,10,10)→(16,5,5)
        )

        # 分类部分（全连接层）
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 5 * 5, 120),   # 400 → 120
            nn.ReLU(),
            nn.Linear(120, 84),            # 120 → 84
            nn.ReLU(),
            nn.Linear(84, num_classes),    # 84 → 10
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        return self.classifier(x)

# 验证形状
model   = LeNet5()
x       = torch.randn(4, 1, 28, 28)
logits  = model(x)
print(f"输入：{x.shape} → 输出：{logits.shape}")  # (4, 10)

# 逐层形状追踪
def trace_shapes(model, x):
    """追踪前向传播中各层的输出形状"""
    print(f"输入: {x.shape}")
    for name, layer in model.features.named_children():
        x = layer(x)
        print(f"  features.{name} ({layer.__class__.__name__}): {x.shape}")
    for name, layer in model.classifier.named_children():
        x = layer(x)
        print(f"  classifier.{name} ({layer.__class__.__name__}): {x.shape}")

trace_shapes(LeNet5(), torch.randn(1, 1, 28, 28))
# 输入: torch.Size([1, 1, 28, 28])
#   features.0 (Conv2d):   torch.Size([1, 6, 28, 28])
#   features.1 (ReLU):     torch.Size([1, 6, 28, 28])
#   features.2 (MaxPool2d):torch.Size([1, 6, 14, 14])
#   features.3 (Conv2d):   torch.Size([1, 16, 10, 10])
#   features.4 (ReLU):     torch.Size([1, 16, 10, 10])
#   features.5 (MaxPool2d):torch.Size([1, 16, 5, 5])
#   classifier.0 (Flatten):torch.Size([1, 400])
#   classifier.1 (Linear): torch.Size([1, 120])
#   ...

# 参数量
total = sum(p.numel() for p in model.parameters())
print(f"LeNet-5 总参数量：{total:,}")   # 约 61,706

七、现代 CNN 的改进：AlexNet 的设计决策

AlexNet（2012 年 ImageNet 竞赛冠军）带来了几个关键改进，延续至今：

# AlexNet 的关键改进点（在 Fashion-MNIST 上的简化版）

class AlexNetSmall(nn.Module):
    """
    AlexNet 的简化版（适合 28×28 输入）
    保留 AlexNet 的核心设计思想
    """

    def __init__(self, num_classes: int = 10):
        super().__init__()

        self.features = nn.Sequential(
            # 改进1：更大的卷积核捕获更大感受野
            nn.Conv2d(1, 32, kernel_size=5, padding=2),   # 5×5 核
            nn.ReLU(),   # 改进2：ReLU 替代 Sigmoid（缓解梯度消失）
            nn.MaxPool2d(2, 2),                            # 28→14

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),                            # 14→7

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
        )

        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d((4, 4)),                  # 自适应池化，固定尺寸
            nn.Flatten(),                                   # 64×4×4 = 1024
            nn.Dropout(0.5),   # 改进3：Dropout 防过拟合
            nn.Linear(64 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

# 关键改进总结（LeNet → AlexNet → 现代 CNN）：
# LeNet:   Sigmoid → MaxPool → FC
# AlexNet: ReLU + Dropout + 更深（5卷积层）+ 更宽（4096神经元）
# 现代:    BN + 残差连接 + GlobalAvgPool（替代 FC）

八、残差连接：为什么深层网络反而更难训练

8.1 退化问题（Degradation Problem）

直觉：更深的网络 = 更强的表达能力 → 应该更准
现实：56层网络比20层网络训练误差更高！

这不是过拟合（训练误差也更高），而是优化困难：

原因分析：
  假设最优解是一个 20 层的网络
  56 层的网络需要额外 36 层学到"恒等映射"（什么都不做）
  但神经网络很难学到精确的恒等映射！

  为什么难？
  梯度消失：深层网络的梯度从输出层传到输入层经过 56 次连乘
            即使用 ReLU 避免了 Sigmoid 的饱和，仍有梯度衰减

8.2 残差连接：让网络学习"残差"

He et al.（2015）的 ResNet 提出了一个优雅的解决方案：

(

)

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

y=F(x)+x

与其让网络直接学习目标映射

(

)

\mathcal{H}(\mathbf{x})

H(x)，不如让它学习残差

(

)

(

)

−

\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}

F(x)=H(x)−x。

为什么残差更容易学习？

如果最优解是恒等映射（什么都不变）：
  原来：网络需要学习 F(x) = x（复杂）
  残差：网络只需要学习 F(x) = 0（把权重推向 0 即可，简单得多）

梯度流动：
  ∂L/∂x = ∂L/∂y × (∂F/∂x + I)
  多出了一个恒等项 I → 梯度可以直接"跳过"这些层流回去
  即使 F 的梯度很小，梯度仍然能通过残差路径传播
  解决了深层网络的梯度消失问题

# 直觉验证：残差块的梯度流动
class WithResidual(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(10, 10), nn.ReLU(),
            nn.Linear(10, 10), nn.ReLU(),
        )

    def forward(self, x):
        return self.layers(x) + x   # 残差连接！

class WithoutResidual(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(10, 10), nn.ReLU(),
            nn.Linear(10, 10), nn.ReLU(),
        )

    def forward(self, x):
        return self.layers(x)       # 无残差

# 模拟深层网络（50 个块）
def measure_gradient_norm(model_class, num_blocks=50):
    blocks = nn.ModuleList([model_class() for _ in range(num_blocks)])
    x = torch.randn(1, 10, requires_grad=True)

    for block in blocks:
        x = block(x)
    loss = x.sum()
    loss.backward()

    # 第一个块的梯度范数（如果梯度消失，这里会很小）
    return blocks[0].layers[0].weight.grad.norm().item()

torch.manual_seed(0)
grad_with    = measure_gradient_norm(WithResidual)
grad_without = measure_gradient_norm(WithoutResidual)
print(f"有残差连接的梯度范数：{grad_with:.6f}")
print(f"无残差连接的梯度范数：{grad_without:.8f}")
# 有残差：梯度范数正常（如 0.01 量级）
# 无残差：梯度范数趋近于 0（梯度消失）

九、ResNet 完整实现

9.1 BasicBlock（ResNet-18/34 使用）

class BasicBlock(nn.Module):
    """
    ResNet 基础残差块（两个 3×3 卷积）
    用于 ResNet-18 和 ResNet-34
    """
    expansion = 1   # 输出通道数 = planes × expansion

    def __init__(self, in_planes: int, planes: int, stride: int = 1):
        super().__init__()

        # 主路径：两个 3×3 卷积
        self.conv1 = nn.Conv2d(in_planes, planes, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(planes)
        self.relu  = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(planes, planes, 3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(planes)

        # 捷径（shortcut）：当尺寸或通道数变化时，需要 1×1 卷积匹配
        self.shortcut = nn.Identity()
        if stride != 1 or in_planes != planes * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, planes * self.expansion, 1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * self.expansion),
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 主路径
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        # 残差相加（维度必须匹配）
        out = out + self.shortcut(x)

        # 注意：ReLU 在相加之后
        return self.relu(out)

class BottleneckBlock(nn.Module):
    """
    ResNet 瓶颈残差块（1×1 + 3×3 + 1×1）
    用于 ResNet-50/101/152（更深但参数效率更高）
    """
    expansion = 4   # 输出通道数 = planes × 4

    def __init__(self, in_planes: int, planes: int, stride: int = 1):
        super().__init__()

        # 1×1 降维（减少 3×3 卷积的计算量）
        self.conv1 = nn.Conv2d(in_planes, planes, 1, bias=False)
        self.bn1   = nn.BatchNorm2d(planes)

        # 3×3 卷积（特征提取）
        self.conv2 = nn.Conv2d(planes, planes, 3, stride=stride, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(planes)

        # 1×1 升维（恢复通道数）
        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
        self.bn3   = nn.BatchNorm2d(planes * self.expansion)

        self.relu  = nn.ReLU(inplace=True)

        self.shortcut = nn.Identity()
        if stride != 1 or in_planes != planes * self.expansion:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, planes * self.expansion, 1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * self.expansion),
            )

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out = out + self.shortcut(x)
        return self.relu(out)

9.2 完整 ResNet（适配小图像）

class ResNet(nn.Module):
    """
    通用 ResNet 实现
    支持 BasicBlock（ResNet-18/34）和 BottleneckBlock（ResNet-50/101/152）
    """

    def __init__(
        self,
        block:       type,
        num_blocks:  list[int],   # 每个阶段的块数 [2,2,2,2] for ResNet-18
        num_classes: int  = 10,
        small_input: bool = True, # True for 28×28/32×32，False for 224×224
    ):
        super().__init__()
        self.in_planes = 64

        if small_input:
            # 小图像（CIFAR/Fashion-MNIST）：简单的 3×3 卷积
            self.stem = nn.Sequential(
                nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1, bias=False),
                nn.BatchNorm2d(64),
                nn.ReLU(inplace=True),
            )
        else:
            # 大图像（ImageNet）：7×7 卷积 + MaxPool
            self.stem = nn.Sequential(
                nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
                nn.BatchNorm2d(64),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(3, stride=2, padding=1),
            )

        # 4 个阶段，每阶段通道数翻倍，尺寸减半
        self.layer1 = self._make_layer(block, 64,  num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        # 分类头
        self.avgpool    = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Linear(512 * block.expansion, num_classes)

        # 权重初始化（重要！影响训练稳定性）
        self._init_weights()

    def _make_layer(self, block, planes, num_blocks, stride):
        """构建一个阶段（多个残差块）"""
        strides = [stride] + [1] * (num_blocks - 1)
        layers  = []
        for s in strides:
            layers.append(block(self.in_planes, planes, stride=s))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def _init_weights(self):
        """He 初始化（针对 ReLU 的最优初始化方案）"""
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias,   0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        x = self.stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        return self.classifier(x)

# 各版本 ResNet 的工厂函数
def ResNet18(num_classes=10, small_input=True):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes, small_input)

def ResNet34(num_classes=10, small_input=True):
    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes, small_input)

def ResNet50(num_classes=10, small_input=True):
    return ResNet(BottleneckBlock, [3, 4, 6, 3], num_classes, small_input)

# 验证各版本参数量
for name, fn in [("ResNet-18", ResNet18), ("ResNet-34", ResNet34), ("ResNet-50", ResNet50)]:
    m = fn(10, small_input=True)
    params = sum(p.numel() for p in m.parameters())
    x = torch.randn(2, 1, 28, 28)
    out = m(x)
    print(f"{name}: {params:,} 参数  输出形状: {out.shape}")
# ResNet-18: 11,173,962 参数
# ResNet-34: 21,282,122 参数
# ResNet-50: 23,508,234 参数

十、完整实战：Fashion-MNIST 准确率到 95%+

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split

# ── 数据准备（更强的数据增强）────────────────────────────────

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(28, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.2860,), (0.3530,)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.2860,), (0.3530,)),
])

full_train = torchvision.datasets.FashionMNIST("./data", True,  download=True, transform=train_transform)
test_data  = torchvision.datasets.FashionMNIST("./data", False, download=True, transform=test_transform)

train_set, val_set = random_split(full_train, [54000, 6000],
                                   generator=torch.Generator().manual_seed(42))
train_loader = DataLoader(train_set, 128, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_set,   256, shuffle=False, num_workers=2)
test_loader  = DataLoader(test_data, 256, shuffle=False, num_workers=2)

# ── 模型与训练设置 ────────────────────────────────────────────

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = ResNet18(num_classes=10, small_input=True).to(device)

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.SGD(                 # ResNet 原论文用 SGD
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)
# 余弦退火调度（从 lr=0.1 下降到 eta_min=1e-4）
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-4)

# ── 训练与评估函数 ────────────────────────────────────────────

def run_epoch(model, loader, criterion, optimizer=None):
    is_train = optimizer is not None
    model.train() if is_train else model.eval()
    total_loss, correct, total = 0., 0, 0

    ctx = torch.enable_grad() if is_train else torch.no_grad()
    with ctx:
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            if is_train:
                optimizer.zero_grad()
            logits = model(images)
            loss   = criterion(logits, labels)
            if is_train:
                loss.backward()
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
            total_loss += loss.item() * len(images)
            correct    += (logits.argmax(1) == labels).sum().item()
            total      += len(images)

    return total_loss / total, correct / total

# ── 训练循环 ──────────────────────────────────────────────────

best_val_acc, patience_cnt = 0., 0
print(f"模型参数量：{sum(p.numel() for p in model.parameters()):,}")
print(f"{'Epoch':^6}{'LR':^9}{'TrLoss':^9}{'TrAcc':^8}{'VaLoss':^9}{'VaAcc':^8}")
print("─" * 50)

for epoch in range(1, 51):
    tr_loss, tr_acc = run_epoch(model, train_loader, criterion, optimizer)
    va_loss, va_acc = run_epoch(model, val_loader,   criterion)
    lr = optimizer.param_groups[0]["lr"]
    scheduler.step()

    flag = " ★" if va_acc > best_val_acc else ""
    print(f"{epoch:^6}{lr:^9.5f}{tr_loss:^9.4f}{tr_acc:^8.4f}{va_loss:^9.4f}{va_acc:^8.4f}{flag}")

    if va_acc > best_val_acc:
        best_val_acc = va_acc
        patience_cnt = 0
        torch.save(model.state_dict(), "best_resnet.pt")
    else:
        patience_cnt += 1
        if patience_cnt >= 10:
            print(f"早停于 epoch {epoch}")
            break

model.load_state_dict(torch.load("best_resnet.pt"))
_, test_acc = run_epoch(model, test_loader, criterion)
print(f"\n测试集准确率：{test_acc:.4f}")
# 典型结果：约 95.0%~95.5%

# ── 各模型性能对比 ────────────────────────────────────────────

results = {
    "Softmax 回归（第二篇）": 0.840,
    "MLP（第三篇）":          0.921,
    "LeNet-5":               0.910,
    "AlexNet-Small":         0.930,
    "ResNet-18（本篇）":      0.953,
}

print("\n模型性能进化：")
print(f"{'模型':30s} {'准确率':>8}")
print("─" * 40)
for name, acc in results.items():
    bar = "█" * int(acc * 30)
    print(f"{name:30s} {acc:.1%}  {bar}")

十一、面试高频问题

Q：卷积和全连接的本质区别是什么？

全连接：每个输出神经元和所有输入神经元相连，参数量 = 输入维度 × 输出维度，没有任何结构假设，把输入看作无序特征向量。卷积：①局部连接——每个输出神经元只连接局部窗口内的像素，利用了图像的局部相关性；②参数共享——同一个卷积核在图像所有位置滑动，检测同一种特征（参数量与图像尺寸无关，只与核大小和通道数有关）；③平移等变性——目标平移，特征图也平移，适合图像这种具有空间结构的数据。

Q：ResNet 的残差连接为什么能解决深层网络的退化问题？

退化问题的本质是深层网络难以学到恒等映射（什么都不变的层）。残差连接通过

>

y

>

=

>

F

>

(

>

x

>

)

>

+

>

x

>

> y = F(x) + x

>

y=F(x)+x 将问题转化为学习残差

>

F

>

(

>

x

>

)

>

=

>

0

>

> F(x) = 0

>

F(x)=0（把权重推向 0 更容易）。此外残差连接为梯度提供了"高速公路"：反向传播时梯度

>

∂

>

L

>

/

>

∂

>

x

>

=

>

∂

>

L

>

/

>

∂

>

y

>

⋅

>

(

>

∂

>

F

>

/

>

∂

>

x

>

+

>

I

>

)

>

> \partial L/\partial x = \partial L/\partial y \cdot (∂F/∂x + I)

>

∂L/∂x=∂L/∂y⋅(∂F/∂x+I)，恒等项

>

I

>

> I

>

I 确保即使

>

F

>

> F

>

F 的梯度很小，梯度仍能直接传回浅层，从根本上缓解了梯度消失。

Q：为什么用多个 3×3 卷积代替一个大卷积核（如 5×5 或 7×7）？

三个优势：①参数更少——两个 3×3 的参数量

>

2

>

×

>

> 3

>

2

>

> =

>

18

>

> 2 \times 3^2 = 18

>

2×32=18，感受野等效于一个 5×5（参数 25），节省 28%；②非线性更强——两个 3×3 之间有一个激活函数，表达能力更强；③计算量更少——卷积计算量

>

∝

>

> k

>

2

>

\propto k^2

>

∝k2，同等感受野下多个小核更省计算。这是 VGG 网络的核心设计思想，被后续几乎所有 CNN 采用。

Q：inplace=True 在 ReLU 里有什么作用？有什么风险？

ReLU(inplace=True) 直接在输入 tensor 上原地修改（不分配新内存），节省显存（训练大模型时显存很宝贵）。风险是：如果被修改的 tensor 还需要用于反向传播（作为其他操作的输入），inplace 会破坏计算图，导致梯度错误（RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation）。安全使用前提：inplace 操作的输出不会被后续操作的反向传播所需要。在 ResNet 的残差相加之前不要用 inplace ReLU，因为输入

>

x

>

> x

>

x 需要用于残差路径。

Q：BatchNorm 在 CNN 中加在哪里？为什么 bias=False？

标准位置：Conv → BN → ReLU（BN 在激活之前）。原因：BN 将激活值归一化到均值 0、方差 1，然后通过可学习的

>

γ

>

> \gamma

>

γ 和

>

β

>

> \beta

>

β 变换，接近 ReLU 的线性区域（梯度最大），有利于梯度流动。bias=False 是因为 BN 包含可学习的偏置参数

>

β

>