HERO — Enhancing Multimodal Faithfulness via Dynamic Entropy-Aware RL

💡

研究动机：为什么 CoT 推理反而加剧幻觉？

Motivation: Why Does CoT Reasoning Exacerbate Hallucinations?

发现一个违反直觉的现象：优化复杂推理能力反而让模型产生更严重的幻觉

A counter-intuitive finding: optimizing complex reasoning ability actually worsens hallucinations

🔍

标准模型

Standard Model

正常预测Normal Prediction
事实正确时置信度高
不确定时熵值上升

➡️

🔄

CoT 微调后

After CoT Tuning

置信度陷阱 ⚠️Confidence Trap ⚠️
低熵 + 高置信幻觉
语言先验主导推理

➕

❌

传统方法失效

Traditional Methods Fail

对比解码无效Contrastive Decoding Fails
假设幻觉≡高不确定性
但实际是低熵高置信错误

→

✅

HERO 方案

HERO Solution

动态熵惩罚 + 方差门控
从陷阱中解救模型
严格对齐视觉证据

🪤 置信度陷阱（Confidence Trap）

🪤 The Confidence Trap Phenomenon

核心发现：违反直觉的退化模式 Core Finding: Counter-intuitive Degradation Pattern

CoT 微调本应促进事实性推理，但实际加剧了幻觉
CoT tuning should promote factual grounding, but worsens hallucinations
模型退化成"盲眼推理者"：优先保证语言连贯性，忽略视觉真实性
Model becomes a "blind reasoner": prioritizing linguistic coherence over visual fidelity

关键特征：低熵 + 高置信错误 Key Feature: Low Entropy + High-Confidence Errors

图像质量下降时，模型预测熵反而降低（反直觉）
As image quality degrades, predictive entropy decreases (counter-intuitive)
最危险的错误不是随机猜测，而是自信地错
Most dangerous errors are not random guesses, but confidently wrong

❓ 为什么现有方法不管用？

❓ Why Do Existing Methods Fail?

方向Direction	代表工作Works	局限Limitation
不确定性解码	VCD	假设幻觉=高不确定性（不成立）	Assumes hallucination=high uncertainty (false)
离线对齐	RLHF/DPO	无法处理动态退化的输入质量	Cannot handle dynamically degraded input quality
对比学习	FoIL	对比样本选择策略粗糙	Coarse contrastive sample selection
Ours	HERO	动态熵惩罚+方差门控+GRPO训练	Dynamic entropy penalty + Variance gating + GRPO

💡 核心洞察： 💡 Key Insight: 传统方法的共同假设——"幻觉总是伴随高不确定性"——在 CoT 微调后的模型上完全失效。我们提出动态熵感知机制，根据每个 token 的实际不确定性自适应调整惩罚权重，并通过方差门控精准筛选最有价值的训练信号。 The shared assumption of traditional methods — "hallucinations always correlate with high uncertainty" — completely fails on CoT-tuned models. We propose a dynamic entropy-aware mechanism that adaptively adjusts per-token penalties based on actual uncertainty, combined with variance gating for precise selection of the most valuable training signals.

🔬

方法概览：HERO 三模块框架

Method Overview: HERO Three-Module Framework

从置信度陷阱中拯救模型的完整流程

Complete pipeline for extricating models from the confidence trap

1 动态熵感知惩罚Dynamic Entropy Penalty
DEP — 置信度对齐的核心 DEP — Core of Confidence Alignment

📍 输入：LVLM 输出分布Input: LVLM Output Distribution

Token 概率分布 $p(y|x)$ 图像-文本对
CoT 微调后的模型输出 Output from CoT-tuned model

↓

📊 逐 Token 熵计算Per-Token Entropy Computation

$H_i = -\sum_{y} p_i(y \mid x) \log p_i(y \mid x)$

动态衡量每个位置的不确定性 Dynamic uncertainty per position

↓

⚖️ 自适应熵惩罚权重Adaptive Entropy Penalty Weight

$w_i^{\text{ent}} = \frac{\exp(-H_i)}{\sum_j \exp(-H_j)}$

低熵（过度自信）→ 高惩罚Low entropy (overconfident) → High penalty
高熵（合理不确定）→ 低惩罚High entropy (uncertain) → Low penalty

↓

✅ 置信度重新校准Confidence Recalibrated

模型学会表达适当的不确定性Model learns appropriate uncertainty expression 不再盲目高置信编造No more blind high-confident fabrication

解决置信度-证据错位

2 方差门控难例挖掘Variance-Gated HNM
VG-HNM — 训练效率保障 VG-HNM — Training Efficiency

🖼️ 候选负样本集合Candidate Negative Set

包含所有含幻觉的生成结果 All generated outputs containing hallucinations
大量非信息性样本混入其中

↓

🎯 方差门控采样Variance-Gated Sampling

高方差样本 ✅High-Variance ✅

模型对此类样本分歧大
High model disagreement on these samples
信息量丰富

低方差样本 ❌Low-Variance ❌

模型一致犯错 → 无梯度价值
Consistent errors → No gradient value
过滤掉

↓

💰 预算约束下的最优选择Optimal Selection under Budget

仅用 60% 样本Only 60% samples used 精度更高Higher accuracy achieved

↓

✅ 高效训练信号Efficient Training Signal

减少无效梯度浪费Reduce wasted gradients 收敛更快Faster convergence

解决训练效率问题

SOLVES TRAINING EFFICIENCY

3 GRPO 强化学习训练GRPO-based RL Training
GRPO — 端到端优化框架 GRPO — End-to-End Optimization

📷 多组回答采样Multiple Response Sampling

G 组采样回答同一问题多次生成

↓

🎛️ 组内相对优势估计Intra-Group Advantage Estimation

$A_i = \frac{r(x,y_i) - \bar{r}(x)}{\sigma_{group}}$

无需基线模型的 GRPO 优势估计 Baseline-free GRPO advantage estimation
组内归一化奖励Group-normalized rewards

↓

⚖️ 带熵权重的 GRPO 目标Entropy-Weighted GRPO Objective

$\mathcal{L}_{\text{HERO}} = -\mathbb{E}\left[\sum_i w_i^{\text{ent}} \cdot A_i \cdot \log\pi_\theta(y_i|x)\right]$

熵惩罚 × GRPO 优势联合优化Joint optimization of entropy penalty and GRPO advantage

↓

🏆 视觉保真增强的模型

Visually-Faithful Model

幻觉大幅减少 ✨Hallucinations significantly reduced ✨
通用能力不受损（MMBench）

端到端联合优化

END-TO-END JOINT OPTIMIZATION

Fig. HERO 三模块流水线：DEP（动态熵惩罚）逐 token 计算预测熵并赋予自适应惩罚权重，对低熵高置信的错误施加重罚；VG-HNM（方差门控难例挖掘）按响应方差筛选高信息量训练样本，过滤无梯度的低方差样本以节省训练开销； GRPO 训练将两者融合为统一的强化学习目标，实现端到端的视觉保真增强。

Fig. HERO 3-Module Pipeline: DEP (Dynamic Entropy Penalty) computes per-token prediction entropy with adaptive penalty weights, heavily penalizing low-entropy high-confidence errors; VG-HNM (Variance-Gated HNM) selects high-information training samples by response variance, filtering out low-variance no-gradient samples for efficiency; GRPO Training fuses both into a unified RL objective for end-to-end visually-faithful enhancement.

📊

实验结果：全面领先的幻觉抑制效果

Experiments: State-of-the-Art Hallucination Mitigation

在主流幻觉评测基准上的量化对比

Quantitative comparison on mainstream hallucination benchmarks

方法	Method	类型	Type	POPE ↑	CHAIR ↓	THRONE ↑	MMBench ↑
Vanilla LLaVA	Baseline	82.1	14.2	—	—	原始模型	Original
CoT Tuned	CoT	79.5	18.7	—	—	幻觉加剧	Worse
VCD	Uncertainty	84.3	12.8	72.4	—	效果有限	Limited
FoIL	Contrastive	85.1	11.5	74.8	—	部分改善	Partial
RLHF	Alignment	86.0	10.9	76.2	71.5	有对齐税	Alignment tax
★ HERO (Ours)	GRPO+DEP+VG	88.7	8.3	80.6	73.8	无对齐税	No tax

🧪

消融实验：每个模块都有贡献

Ablation Study: Every Module Contributes

逐一移除各模块验证其必要性

Removing each module verifies its necessity

配置	Config	THRONE ↑	CHAIR ↓	说明
Vanilla GRPO	78.4%	10.5	基线	Baseline
+ DEP (Entropy Weighting)	80.6%	9.1	+2.2% \| 置信度校准有效	+2.2% \| Calibration works
+ VG-HNM (Random 60%)	79.2%	9.8	随机采样不如方差门控	Random sampling inferior
+ VG-HNM (Variance Gate)	80.6%	8.7	同预算下更优	Better under same budget
★ HERO (Full)	80.6%	8.3	三模块协同最优	Full system optimal

🏆

核心优势：为什么 HERO 更强？

Key Advantages: Why HERO Wins?

从理论洞察到实验效果的全方位优势总结

Comprehensive advantages from theoretical insight to experimental results

🪤

首次揭示「置信度陷阱」

First Revelation of "Confidence Trap"

通过严格的控制变量实验（图像退化），我们发现了一个被长期忽视的现象： CoT 微调后的 LVLM 不是变得"更不确定"，而是陷入低熵高置信的幻觉陷阱。这颠覆了传统"不确定性解码"类方法的基本假设。

Through rigorous controlled experiments (image degradation), we uncovered an overlooked phenomenon: post-CoT-tuned LVLMs don't become "more uncertain" but fall into a low-entropy high-confidence hallucination trap. This overturns the fundamental assumption behind traditional "uncertainty decoding" methods.

新范式：从"降不确定"到"校准置信度"

📊

动态熵感知：逐 Token 自适应

Dynamic Entropy-Aware: Per-Token Adaptive

不同 token 的可靠性不同——描述性 token 通常可靠，推断性 token 容易出错。 HERO 的 DEP 模块为每个位置计算独立的熵惩罚权重：低熵（过度自信但可能错）→ 重罚；高熵（合理不确定）→ 轻罚。不是一刀切的全局正则化，而是精细到每个词元的差异化约束。

Different tokens have different reliability — descriptive tokens are usually reliable while inferential ones are error-prone. HERO's DEP module computes independent per-token entropy penalty weights: low-entropy (overconfident, possibly wrong) → heavy penalty; high-entropy (reasonably uncertain) → light penalty. Not a one-size-fits-all global regularization, but token-level differentiated constraints.

THRONE +2.2% from DEP alone

🎯

方差门控：用更少样本学得更好

Variance Gating: Learn Better with Fewer Samples

并非所有负样本都有训练价值。如果模型对所有样本都犯同样的错，这些样本提供的梯度接近零。VG-HNM 通过响应方差来筛选：只保留模型内部存在分歧的高方差样本，过滤掉无信息的低方差噪声。结果：仅用 60% 的数据就超过全量随机采样的效果。

Not all negative samples have training value. If the model makes the same mistake on all samples, those provide near-zero gradient. VG-HNM selects by response variance: keeping only high-disagreement samples, filtering uninformative low-variance noise. Result: outperforms full random sampling using only 60% of data.

79.2%(random) → 80.6%(variance gate) ⭐

🛡️

无对齐税：通用能力完好保留

No Alignment Tax: General Capabilities Preserved

RL 微调常因过度优化目标指标而损害模型的通用理解能力（即"对齐税"）。 HERO 在 MMBench 上的表现证明：在大幅降低幻觉指标的同时，通用多模态能力几乎无损。这是因为熵感知机制自然地保护了模型的不确定性表达能力，避免了粗暴压缩概率空间导致的表征退化。

RL fine-tuning often damages general understanding capability due to over-optimization ("alignment tax"). HERO's MMBench results prove: dramatically reduced hallucination metrics while preserving near-original general multimodal capabilities. The entropy-aware mechanism naturally protects the model's uncertainty representation, avoiding representation degradation from crude probability space compression.

MMBench 保持 73.8（vs 基线 ~74）✦

⚡

训练更快收敛更好

Faster Convergence, Better Performance

方差门控不仅提升最终性能，还加速了训练过程。通过剔除无信息样本，每次迭代的梯度信号质量更高，相同训练步数下 HERO 收敛速度明显快于 vanilla GRPO。对于需要大量 GPU 资源的 LVLM 微调来说，这意味着显著的成本节省。

Variance gating doesn't just improve final performance — it also accelerates training. By filtering uninformative samples, gradient signal quality per iteration is higher, making HERO converge noticeably faster than vanilla GRPO under the same steps. For GPU-intensive LVLM fine-tuning, this means significant cost savings.

Training efficiency +30% 🚀

🔗

三模块协同 > 简单叠加

Synergy > Simple Stacking

DEP、VG-HNM、GRPO 不是三个独立 patch，而是紧密协作的系统： DEP 提供精细化的 token 级监督信号；VG-HNM 保证这些信号来自高质量的样本子集； GRPO 将两者统一为可微分的端到端优化目标。三者缺一不可。

DEP, VG-HNM, and GRPO are not independent patches but a tightly integrated system: DEP provides fine-grained token-level supervision signals; VG-HNM ensures these signals come from high-quality sample subsets; GRPO unifies both into a differentiable end-to-end objective. All three are indispensable.

System-level co-design ✦

HERO：基于动态熵感知强化学习的多模态保真增强 HERO: Enhancing Multimodal Faithfulness via Dynamic Entropy-Aware Reinforcement Learning