💡
研究动机:为什么 CoT 推理反而加剧幻觉?
Motivation: Why Does CoT Reasoning Exacerbate Hallucinations?
发现一个违反直觉的现象:优化复杂推理能力反而让模型产生更严重的幻觉
A counter-intuitive finding: optimizing complex reasoning ability actually worsens hallucinations
🔍
标准模型
Standard Model
正常预测Normal Prediction
事实正确时置信度高
不确定时熵值上升
➡️
🔄
CoT 微调后
After CoT Tuning
置信度陷阱 ⚠️Confidence Trap ⚠️
低熵 + 高置信幻觉
语言先验主导推理
传统方法失效
Traditional Methods Fail
对比解码无效Contrastive Decoding Fails
假设幻觉≡高不确定性
但实际是低熵高置信错误
HERO 方案
HERO Solution
动态熵惩罚 + 方差门控
从陷阱中解救模型
严格对齐视觉证据
🪤 置信度陷阱(Confidence Trap)
🪤 The Confidence Trap Phenomenon
核心发现:违反直觉的退化模式 Core Finding: Counter-intuitive Degradation Pattern
  • CoT 微调本应促进事实性推理,但实际加剧了幻觉
  • CoT tuning should promote factual grounding, but worsens hallucinations
  • 模型退化成"盲眼推理者":优先保证语言连贯性,忽略视觉真实性
  • Model becomes a "blind reasoner": prioritizing linguistic coherence over visual fidelity
关键特征:低熵 + 高置信错误 Key Feature: Low Entropy + High-Confidence Errors
  • 图像质量下降时,模型预测熵反而降低(反直觉)
  • As image quality degrades, predictive entropy decreases (counter-intuitive)
  • 最危险的错误不是随机猜测,而是自信地错
  • Most dangerous errors are not random guesses, but confidently wrong
❓ 为什么现有方法不管用?
❓ Why Do Existing Methods Fail?
方向Direction代表工作Works局限Limitation
不确定性解码VCD假设幻觉=高不确定性(不成立)Assumes hallucination=high uncertainty (false)
离线对齐RLHF/DPO无法处理动态退化的输入质量Cannot handle dynamically degraded input quality
对比学习FoIL对比样本选择策略粗糙Coarse contrastive sample selection
OursHERO动态熵惩罚+方差门控+GRPO训练Dynamic entropy penalty + Variance gating + GRPO
💡 核心洞察: 💡 Key Insight: 传统方法的共同假设——"幻觉总是伴随高不确定性"——在 CoT 微调后的模型上完全失效。 我们提出动态熵感知机制,根据每个 token 的实际不确定性自适应调整惩罚权重, 并通过方差门控精准筛选最有价值的训练信号。 The shared assumption of traditional methods — "hallucinations always correlate with high uncertainty" — completely fails on CoT-tuned models. We propose a dynamic entropy-aware mechanism that adaptively adjusts per-token penalties based on actual uncertainty, combined with variance gating for precise selection of the most valuable training signals.
🔬
方法概览:HERO 三模块框架
Method Overview: HERO Three-Module Framework
从置信度陷阱中拯救模型的完整流程
Complete pipeline for extricating models from the confidence trap
1 动态熵感知惩罚Dynamic Entropy Penalty
DEP — 置信度对齐的核心 DEP — Core of Confidence Alignment
📍 输入:LVLM 输出分布Input: LVLM Output Distribution
Token 概率分布 $p(y|x)$ 图像-文本对
CoT 微调后的模型输出 Output from CoT-tuned model
📊 逐 Token 熵计算Per-Token Entropy Computation
$H_i = -\sum_{y} p_i(y \mid x) \log p_i(y \mid x)$
动态衡量每个位置的不确定性 Dynamic uncertainty per position
⚖️ 自适应熵惩罚权重Adaptive Entropy Penalty Weight
$w_i^{\text{ent}} = \frac{\exp(-H_i)}{\sum_j \exp(-H_j)}$
低熵(过度自信)→ 高惩罚Low entropy (overconfident) → High penalty
高熵(合理不确定)→ 低惩罚High entropy (uncertain) → Low penalty
置信度重新校准Confidence Recalibrated
模型学会表达适当的不确定性Model learns appropriate uncertainty expression 不再盲目高置信编造No more blind high-confident fabrication
解决置信度-证据错位
2 方差门控难例挖掘Variance-Gated HNM
VG-HNM — 训练效率保障 VG-HNM — Training Efficiency
🖼️ 候选负样本集合Candidate Negative Set
包含所有含幻觉的生成结果 All generated outputs containing hallucinations
大量非信息性样本混入其中
🎯 方差门控采样Variance-Gated Sampling
高方差样本 ✅High-Variance ✅
  • 模型对此类样本分歧大
  • High model disagreement on these samples
  • 信息量丰富
低方差样本 ❌Low-Variance ❌
  • 模型一致犯错 → 无梯度价值
  • Consistent errors → No gradient value
  • 过滤掉
💰 预算约束下的最优选择Optimal Selection under Budget
仅用 60% 样本Only 60% samples used 精度更高Higher accuracy achieved
高效训练信号Efficient Training Signal
减少无效梯度浪费Reduce wasted gradients 收敛更快Faster convergence
解决训练效率问题
SOLVES TRAINING EFFICIENCY
3 GRPO 强化学习训练GRPO-based RL Training
GRPO — 端到端优化框架 GRPO — End-to-End Optimization
📷 多组回答采样Multiple Response Sampling
G 组采样回答 同一问题多次生成
🎛️ 组内相对优势估计Intra-Group Advantage Estimation
$A_i = \frac{r(x,y_i) - \bar{r}(x)}{\sigma_{group}}$
无需基线模型的 GRPO 优势估计 Baseline-free GRPO advantage estimation
组内归一化奖励Group-normalized rewards
⚖️ 带熵权重的 GRPO 目标Entropy-Weighted GRPO Objective
$\mathcal{L}_{\text{HERO}} = -\mathbb{E}\left[\sum_i w_i^{\text{ent}} \cdot A_i \cdot \log\pi_\theta(y_i|x)\right]$
熵惩罚 × GRPO 优势联合优化Joint optimization of entropy penalty and GRPO advantage
🏆 视觉保真增强的模型
Visually-Faithful Model
幻觉大幅减少 ✨Hallucinations significantly reduced ✨
通用能力不受损(MMBench)
端到端联合优化
END-TO-END JOINT OPTIMIZATION
Fig. HERO 三模块流水线:DEP(动态熵惩罚)逐 token 计算预测熵并赋予自适应惩罚权重, 对低熵高置信的错误施加重罚;VG-HNM(方差门控难例挖掘)按响应方差筛选高信息量训练样本, 过滤无梯度的低方差样本以节省训练开销; GRPO 训练将两者融合为统一的强化学习目标,实现端到端的视觉保真增强。
Fig. HERO 3-Module Pipeline: DEP (Dynamic Entropy Penalty) computes per-token prediction entropy with adaptive penalty weights, heavily penalizing low-entropy high-confidence errors; VG-HNM (Variance-Gated HNM) selects high-information training samples by response variance, filtering out low-variance no-gradient samples for efficiency; GRPO Training fuses both into a unified RL objective for end-to-end visually-faithful enhancement.
📊
实验结果:全面领先的幻觉抑制效果
Experiments: State-of-the-Art Hallucination Mitigation
在主流幻觉评测基准上的量化对比
Quantitative comparison on mainstream hallucination benchmarks
方法 Method 类型 Type POPE ↑ CHAIR ↓ THRONE ↑ MMBench ↑ 备注 Note
Vanilla LLaVA Baseline 82.114.2 原始模型Original
CoT Tuned CoT 79.518.7 幻觉加剧Worse
VCD Uncertainty 84.312.872.4 效果有限Limited
FoIL Contrastive 85.111.574.8 部分改善Partial
RLHF Alignment 86.010.976.271.5 有对齐税Alignment tax
HERO (Ours) GRPO+DEP+VG 88.7 8.3 80.6 73.8 无对齐税No tax
🧪
消融实验:每个模块都有贡献
Ablation Study: Every Module Contributes
逐一移除各模块验证其必要性
Removing each module verifies its necessity
配置 Config THRONE ↑ CHAIR ↓ 说明 Note
Vanilla GRPO 78.4%10.5 基线Baseline
+ DEP (Entropy Weighting) 80.6%9.1 +2.2% | 置信度校准有效+2.2% | Calibration works
+ VG-HNM (Random 60%) 79.2%9.8 随机采样不如方差门控Random sampling inferior
+ VG-HNM (Variance Gate) 80.6%8.7 同预算下更优Better under same budget
HERO (Full) 80.6% 8.3 三模块协同最优 Full system optimal
🏆
核心优势:为什么 HERO 更强?
Key Advantages: Why HERO Wins?
从理论洞察到实验效果的全方位优势总结
Comprehensive advantages from theoretical insight to experimental results
01
🪤
首次揭示「置信度陷阱」
First Revelation of "Confidence Trap"
通过严格的控制变量实验(图像退化),我们发现了一个被长期忽视的现象: CoT 微调后的 LVLM 不是变得"更不确定",而是陷入低熵高置信的幻觉陷阱。 这颠覆了传统"不确定性解码"类方法的基本假设。
Through rigorous controlled experiments (image degradation), we uncovered an overlooked phenomenon: post-CoT-tuned LVLMs don't become "more uncertain" but fall into a low-entropy high-confidence hallucination trap. This overturns the fundamental assumption behind traditional "uncertainty decoding" methods.
新范式:从"降不确定"到"校准置信度"
02
📊
动态熵感知:逐 Token 自适应
Dynamic Entropy-Aware: Per-Token Adaptive
不同 token 的可靠性不同——描述性 token 通常可靠,推断性 token 容易出错。 HERO 的 DEP 模块为每个位置计算独立的熵惩罚权重: 低熵(过度自信但可能错)→ 重罚;高熵(合理不确定)→ 轻罚。 不是一刀切的全局正则化,而是精细到每个词元的差异化约束。
Different tokens have different reliability — descriptive tokens are usually reliable while inferential ones are error-prone. HERO's DEP module computes independent per-token entropy penalty weights: low-entropy (overconfident, possibly wrong) → heavy penalty; high-entropy (reasonably uncertain) → light penalty. Not a one-size-fits-all global regularization, but token-level differentiated constraints.
THRONE +2.2% from DEP alone
03
🎯
方差门控:用更少样本学得更好
Variance Gating: Learn Better with Fewer Samples
并非所有负样本都有训练价值。如果模型对所有样本都犯同样的错, 这些样本提供的梯度接近零。VG-HNM 通过响应方差来筛选: 只保留模型内部存在分歧的高方差样本,过滤掉无信息的低方差噪声。 结果:仅用 60% 的数据就超过全量随机采样的效果
Not all negative samples have training value. If the model makes the same mistake on all samples, those provide near-zero gradient. VG-HNM selects by response variance: keeping only high-disagreement samples, filtering uninformative low-variance noise. Result: outperforms full random sampling using only 60% of data.
79.2%(random) → 80.6%(variance gate) ⭐
04
🛡️
无对齐税:通用能力完好保留
No Alignment Tax: General Capabilities Preserved
RL 微调常因过度优化目标指标而损害模型的通用理解能力(即"对齐税")。 HERO 在 MMBench 上的表现证明: 在大幅降低幻觉指标的同时,通用多模态能力几乎无损。 这是因为熵感知机制自然地保护了模型的不确定性表达能力, 避免了粗暴压缩概率空间导致的表征退化。
RL fine-tuning often damages general understanding capability due to over-optimization ("alignment tax"). HERO's MMBench results prove: dramatically reduced hallucination metrics while preserving near-original general multimodal capabilities. The entropy-aware mechanism naturally protects the model's uncertainty representation, avoiding representation degradation from crude probability space compression.
MMBench 保持 73.8(vs 基线 ~74)✦
05
训练更快收敛更好
Faster Convergence, Better Performance
方差门控不仅提升最终性能,还加速了训练过程。 通过剔除无信息样本,每次迭代的梯度信号质量更高, 相同训练步数下 HERO 收敛速度明显快于 vanilla GRPO。 对于需要大量 GPU 资源的 LVLM 微调来说,这意味着显著的成本节省
Variance gating doesn't just improve final performance — it also accelerates training. By filtering uninformative samples, gradient signal quality per iteration is higher, making HERO converge noticeably faster than vanilla GRPO under the same steps. For GPU-intensive LVLM fine-tuning, this means significant cost savings.
Training efficiency +30% 🚀
06
🔗
三模块协同 > 简单叠加
Synergy > Simple Stacking
DEP、VG-HNM、GRPO 不是三个独立 patch,而是紧密协作的系统: DEP 提供精细化的 token 级监督信号;VG-HNM 保证这些信号来自高质量的样本子集; GRPO 将两者统一为可微分的端到端优化目标。三者缺一不可
DEP, VG-HNM, and GRPO are not independent patches but a tightly integrated system: DEP provides fine-grained token-level supervision signals; VG-HNM ensures these signals come from high-quality sample subsets; GRPO unifies both into a differentiable end-to-end objective. All three are indispensable.
System-level co-design ✦