信息最大化（Information Maximization）

信息最大化在目标域无标签的域自适应任务中，它迫使模型在没有真实标签的情况下，对未标记数据产生高置信度且类别均衡的预测。此外，这些预测也可以作为伪标签用于自训练。

例如，在目标域没有标签时，信息最大化损失可以应用于目标域数据，使模型适应目标域并产生有意义的预测，缓解源域和目标域之间的分布偏移。在自训练或生成模型中，信息最大化通过要求整体预测分布均衡，有效防止模型将所有样本都预测到少数几个类别上。

信息最大化损失函数

信息最大化的损失函数可以表达为[1]：
$\begin{align} \mathcal{L}_{IM} &= \mathcal{L}_{ent}+\mathcal{L}_{div} \\ &= H(Y|X)-H(Y) \end{align}$

式中， $H (Y ∣ X)$ 模型预测输出标签的信息熵，最小化条件熵让模型的预测P(Y|X)更加自信。 $H (Y)$ 是预测类别标签的边缘熵，由于损失前面有个负号，则需要最大化边缘熵，迫使模型预测的各个类别均匀分布，而不是偏向其中某个类别。

信息最大化本质是最大化预测标签 $Y$ 与输入 $X$ 的互信息 $I (X; Y) = H (Y) - H (Y ∣ X)$ ，因此 $\mathcal{L}_{IM} = -I(X;Y)$ 。最小化该损失等价于提升输入与预测标签之间的互信息。

① 熵最小化损失 (Entropy Minimization)：
$\begin{align} \mathcal{L}_{ent} = -\mathbb{E}_x \sum_{c=1}^C \delta_c(f(x)) \log \delta_c(f(x)) \end{align}$

式中， $f (x)$ 表示模型的预测输出， $\delta_c$ 是softmax函数，代表样本 $x$ 是类别 $c$ 的概率值。

② 多样性最大化损失 (Diversity Regularization)：
$\begin{align} \mathcal{L}_{div} &=-(- \sum_{c=1}^C \hat{p}_c \log \hat{p}_c)\\ &= \sum_{c=1}^C \hat{p}_c \log \hat{p}_c \end{align}$

其中， $\hat{p}_c=\frac{1}{N}\sum_{i=1}^{N}p_{ic}$ 表示类别 $c$ 在整个批次上的平均预测概率，也就是类别 $c$ 的边缘分布。注意：在数学上， $P (Y = c)$ 的边缘概率应该为 $\sum_{i=1}^N P(X = x_i, Y = c)$ ，这里采用的均值而非总和，即采用批次均值 $\hat{p}_c$ 作为无偏估计。

总结：

最小化条件熵 $H (Y ∣ X)$ ：迫使模型对每个目标域样本做出确定性预测。
最大化边缘熵 $H (Y)$ ：确保模型在整个目标域上的预测类别分布均匀，防止坍缩到少数类。

代码

import torch
import torch.nn as nn
import torch.nn.functional as Fclass IMLoss(nn.Module):def __init__(self, lambda_div=0.1, eps=1e-8):"""信息最大化损失函数:无监督学习范式Args:lambda_div (float): 多样性损失的权重系数，默认0.1eps (float): 数值稳定项，防止log(0)，默认1e-8"""super().__init__()self.lambda_div = lambda_divself.eps = epsdef forward(self, logits):"""计算信息最大化损失Args:logits (torch.Tensor): 模型输出的logits张量，形状为(batch_size, num_classes)Returns:torch.Tensor: 计算得到的总损失值"""# 计算softmax概率probs = F.softmax(logits, dim=1)# 1. L_ent: 熵最小化损失,使预测更确定entropy_per_sample = -torch.sum(probs * torch.log(probs + self.eps), dim=1)entropy_loss = torch.mean(entropy_per_sample)# 2. L_div: 多样性最大化损失, 使类别分布均匀mean_probs = torch.mean(probs, dim=0) # 边缘分布，由于样本是独立同分布的，这里考虑概率的平均值而非总和diversity_loss = -torch.sum(mean_probs * torch.log(mean_probs + self.eps))# L_IM总损失total_loss = entropy_loss - self.lambda_div * diversity_lossreturn total_lossnum_classes=3
bs=2logits = torch.randn(bs, num_classes) # 模型输出的逻辑值：model(x)
loss_fn = IMLoss()
loss = loss_fn(logits)

知识点

边缘分布/边际分布（Marginal distribution）定义[2]：Given a known joint distribution of two discrete random variables, say, $X$ and $Y$ , the marginal distribution of either variable – $X$ for example – is the probability distribution of $X$ when the values of $Y$ are not taken into consideration.
$p_X(x_i)=\sum_jp(x_i,y_j),\\ p_Y(y_j)=\sum_i p(x_i,y_j)$

案例：

如下表所示，一个批次有3个样本，类别为4，对应的随机变量Y的边缘分布为最后一行。

	$y_1$	$y_2$	$y_3$	$y_4$	$p_X(x)$
$x_1$	$\frac{4}{32}$	$\frac{2}{32}$	$\frac{1}{32}$	$\frac{1}{32}$	$\frac{8}{32}$
$x_2$	$\frac{3}{32}$	$\frac{6}{32}$	$\frac{3}{32}$	$\frac{3}{32}$	$\frac{15}{32}$
$x_3$	$\frac{9}{32}$	$0$	$0$	$0$	$\frac{9}{32}$
$p_Y(y)$	$\frac{16}{32}$	$\frac{8}{32}$	$\frac{4}{32}$	$\frac{4}{32}$	$\frac{32}{32}$