研究动机:为什么 CoT 推理反而加剧幻觉?
Motivation: Why Does CoT Reasoning Exacerbate Hallucinations?
发现一个违反直觉的现象:优化复杂推理能力反而让模型产生更严重的幻觉
A counter-intuitive finding: optimizing complex reasoning ability actually worsens hallucinations
标准模型
Standard Model
正常预测Normal Prediction
事实正确时置信度高
不确定时熵值上升
事实正确时置信度高
不确定时熵值上升
➡️
CoT 微调后
After CoT Tuning
置信度陷阱 ⚠️Confidence Trap ⚠️
低熵 + 高置信幻觉
语言先验主导推理
低熵 + 高置信幻觉
语言先验主导推理
➕
传统方法失效
Traditional Methods Fail
对比解码无效Contrastive Decoding Fails
假设幻觉≡高不确定性
但实际是低熵高置信错误
假设幻觉≡高不确定性
但实际是低熵高置信错误
→
HERO 方案
HERO Solution
动态熵惩罚 + 方差门控
从陷阱中解救模型
严格对齐视觉证据
从陷阱中解救模型
严格对齐视觉证据
🪤 置信度陷阱(Confidence Trap)
🪤 The Confidence Trap Phenomenon
核心发现:违反直觉的退化模式
Core Finding: Counter-intuitive Degradation Pattern
- CoT 微调本应促进事实性推理,但实际加剧了幻觉
- CoT tuning should promote factual grounding, but worsens hallucinations
- 模型退化成"盲眼推理者":优先保证语言连贯性,忽略视觉真实性
- Model becomes a "blind reasoner": prioritizing linguistic coherence over visual fidelity
关键特征:低熵 + 高置信错误
Key Feature: Low Entropy + High-Confidence Errors
- 图像质量下降时,模型预测熵反而降低(反直觉)
- As image quality degrades, predictive entropy decreases (counter-intuitive)
- 最危险的错误不是随机猜测,而是自信地错
- Most dangerous errors are not random guesses, but confidently wrong
❓ 为什么现有方法不管用?
❓ Why Do Existing Methods Fail?
| 方向Direction | 代表工作Works | 局限Limitation | |
| 不确定性解码 | VCD | 假设幻觉=高不确定性(不成立) | Assumes hallucination=high uncertainty (false) |
| 离线对齐 | RLHF/DPO | 无法处理动态退化的输入质量 | Cannot handle dynamically degraded input quality |
| 对比学习 | FoIL | 对比样本选择策略粗糙 | Coarse contrastive sample selection |
| Ours | HERO | 动态熵惩罚+方差门控+GRPO训练 | Dynamic entropy penalty + Variance gating + GRPO |
💡 核心洞察:
💡 Key Insight:
传统方法的共同假设——"幻觉总是伴随高不确定性"——在 CoT 微调后的模型上完全失效。
我们提出动态熵感知机制,根据每个 token 的实际不确定性自适应调整惩罚权重,
并通过方差门控精准筛选最有价值的训练信号。
The shared assumption of traditional methods — "hallucinations always correlate with high uncertainty" —
completely fails on CoT-tuned models. We propose a dynamic entropy-aware mechanism that adaptively adjusts
per-token penalties based on actual uncertainty, combined with variance gating for precise selection of
the most valuable training signals.
方法概览:HERO 三模块框架
Method Overview: HERO Three-Module Framework
从置信度陷阱中拯救模型的完整流程
Complete pipeline for extricating models from the confidence trap
1
动态熵感知惩罚Dynamic Entropy Penalty
DEP — 置信度对齐的核心 DEP — Core of Confidence Alignment
DEP — 置信度对齐的核心 DEP — Core of Confidence Alignment
📍 输入:LVLM 输出分布Input: LVLM Output Distribution
Token 概率分布 $p(y|x)$ 图像-文本对
CoT 微调后的模型输出 Output from CoT-tuned model
CoT 微调后的模型输出 Output from CoT-tuned model
↓
📊 逐 Token 熵计算Per-Token Entropy Computation
$H_i = -\sum_{y} p_i(y \mid x) \log p_i(y \mid x)$
动态衡量每个位置的不确定性
Dynamic uncertainty per position
↓
⚖️ 自适应熵惩罚权重Adaptive Entropy Penalty Weight
$w_i^{\text{ent}} = \frac{\exp(-H_i)}{\sum_j \exp(-H_j)}$
低熵(过度自信)→ 高惩罚Low entropy (overconfident) → High penalty
高熵(合理不确定)→ 低惩罚High entropy (uncertain) → Low penalty
高熵(合理不确定)→ 低惩罚High entropy (uncertain) → Low penalty
↓
✅ 置信度重新校准Confidence Recalibrated
模型学会表达适当的不确定性Model learns appropriate uncertainty expression
不再盲目高置信编造No more blind high-confident fabrication
解决置信度-证据错位
2
方差门控难例挖掘Variance-Gated HNM
VG-HNM — 训练效率保障 VG-HNM — Training Efficiency
VG-HNM — 训练效率保障 VG-HNM — Training Efficiency
🖼️ 候选负样本集合Candidate Negative Set
包含所有含幻觉的生成结果
All generated outputs containing hallucinations
大量非信息性样本混入其中
大量非信息性样本混入其中
↓
🎯 方差门控采样Variance-Gated Sampling
- 模型对此类样本分歧大
- High model disagreement on these samples
- 信息量丰富
- 模型一致犯错 → 无梯度价值
- Consistent errors → No gradient value
- 过滤掉
↓
💰 预算约束下的最优选择Optimal Selection under Budget
仅用 60% 样本Only 60% samples used
精度更高Higher accuracy achieved
↓
✅ 高效训练信号Efficient Training Signal
减少无效梯度浪费Reduce wasted gradients
收敛更快Faster convergence
解决训练效率问题
SOLVES TRAINING EFFICIENCY
3
GRPO 强化学习训练GRPO-based RL Training
GRPO — 端到端优化框架 GRPO — End-to-End Optimization
GRPO — 端到端优化框架 GRPO — End-to-End Optimization
📷 多组回答采样Multiple Response Sampling
G 组采样回答 同一问题多次生成
↓
🎛️ 组内相对优势估计Intra-Group Advantage Estimation
$A_i = \frac{r(x,y_i) - \bar{r}(x)}{\sigma_{group}}$
无需基线模型的 GRPO 优势估计
Baseline-free GRPO advantage estimation
组内归一化奖励Group-normalized rewards
组内归一化奖励Group-normalized rewards
↓
⚖️ 带熵权重的 GRPO 目标Entropy-Weighted GRPO Objective
$\mathcal{L}_{\text{HERO}} = -\mathbb{E}\left[\sum_i w_i^{\text{ent}} \cdot A_i \cdot \log\pi_\theta(y_i|x)\right]$
熵惩罚 × GRPO 优势联合优化Joint optimization of entropy penalty and GRPO advantage
↓
🏆 视觉保真增强的模型
Visually-Faithful Model
幻觉大幅减少 ✨Hallucinations significantly reduced ✨
通用能力不受损(MMBench)
通用能力不受损(MMBench)
端到端联合优化
END-TO-END JOINT OPTIMIZATION
Fig. HERO 三模块流水线:DEP(动态熵惩罚)逐 token 计算预测熵并赋予自适应惩罚权重,
对低熵高置信的错误施加重罚;VG-HNM(方差门控难例挖掘)按响应方差筛选高信息量训练样本,
过滤无梯度的低方差样本以节省训练开销;
GRPO 训练将两者融合为统一的强化学习目标,实现端到端的视觉保真增强。
Fig. HERO 3-Module Pipeline: DEP (Dynamic Entropy Penalty) computes per-token prediction entropy
with adaptive penalty weights, heavily penalizing low-entropy high-confidence errors;
VG-HNM (Variance-Gated HNM) selects high-information training samples by response variance,
filtering out low-variance no-gradient samples for efficiency;
GRPO Training fuses both into a unified RL objective for end-to-end visually-faithful enhancement.
实验结果:全面领先的幻觉抑制效果
Experiments: State-of-the-Art Hallucination Mitigation
在主流幻觉评测基准上的量化对比
Quantitative comparison on mainstream hallucination benchmarks
| 方法 | Method | 类型 | Type | POPE ↑ | CHAIR ↓ | THRONE ↑ | MMBench ↑ | 备注 | Note |
|---|---|---|---|---|---|---|---|---|---|
| Vanilla LLaVA | Baseline | 82.1 | 14.2 | — | — | 原始模型 | Original | ||
| CoT Tuned | CoT | 79.5 | 18.7 | — | — | 幻觉加剧 | Worse | ||
| VCD | Uncertainty | 84.3 | 12.8 | 72.4 | — | 效果有限 | Limited | ||
| FoIL | Contrastive | 85.1 | 11.5 | 74.8 | — | 部分改善 | Partial | ||
| RLHF | Alignment | 86.0 | 10.9 | 76.2 | 71.5 | 有对齐税 | Alignment tax | ||
| ★ HERO (Ours) | GRPO+DEP+VG | 88.7 | 8.3 | 80.6 | 73.8 | 无对齐税 | No tax |
消融实验:每个模块都有贡献
Ablation Study: Every Module Contributes
逐一移除各模块验证其必要性
Removing each module verifies its necessity
| 配置 | Config | THRONE ↑ | CHAIR ↓ | 说明 | Note |
|---|---|---|---|---|---|
| Vanilla GRPO | 78.4% | 10.5 | 基线 | Baseline | |
| + DEP (Entropy Weighting) | 80.6% | 9.1 | +2.2% | 置信度校准有效 | +2.2% | Calibration works | |
| + VG-HNM (Random 60%) | 79.2% | 9.8 | 随机采样不如方差门控 | Random sampling inferior | |
| + VG-HNM (Variance Gate) | 80.6% | 8.7 | 同预算下更优 | Better under same budget | |
| ★ HERO (Full) | 80.6% | 8.3 | 三模块协同最优 | Full system optimal |
核心优势:为什么 HERO 更强?
Key Advantages: Why HERO Wins?
从理论洞察到实验效果的全方位优势总结
Comprehensive advantages from theoretical insight to experimental results
01
首次揭示「置信度陷阱」
First Revelation of "Confidence Trap"
通过严格的控制变量实验(图像退化),我们发现了一个被长期忽视的现象:
CoT 微调后的 LVLM 不是变得"更不确定",而是陷入低熵高置信的幻觉陷阱。
这颠覆了传统"不确定性解码"类方法的基本假设。
Through rigorous controlled experiments (image degradation), we uncovered an overlooked phenomenon:
post-CoT-tuned LVLMs don't become "more uncertain" but fall into a low-entropy high-confidence hallucination trap.
This overturns the fundamental assumption behind traditional "uncertainty decoding" methods.
新范式:从"降不确定"到"校准置信度"
02
动态熵感知:逐 Token 自适应
Dynamic Entropy-Aware: Per-Token Adaptive
不同 token 的可靠性不同——描述性 token 通常可靠,推断性 token 容易出错。
HERO 的 DEP 模块为每个位置计算独立的熵惩罚权重:
低熵(过度自信但可能错)→ 重罚;高熵(合理不确定)→ 轻罚。
不是一刀切的全局正则化,而是精细到每个词元的差异化约束。
Different tokens have different reliability — descriptive tokens are usually reliable while inferential ones are error-prone.
HERO's DEP module computes independent per-token entropy penalty weights:
low-entropy (overconfident, possibly wrong) → heavy penalty; high-entropy (reasonably uncertain) → light penalty.
Not a one-size-fits-all global regularization, but token-level differentiated constraints.
THRONE +2.2% from DEP alone
03
方差门控:用更少样本学得更好
Variance Gating: Learn Better with Fewer Samples
并非所有负样本都有训练价值。如果模型对所有样本都犯同样的错,
这些样本提供的梯度接近零。VG-HNM 通过响应方差来筛选:
只保留模型内部存在分歧的高方差样本,过滤掉无信息的低方差噪声。
结果:仅用 60% 的数据就超过全量随机采样的效果。
Not all negative samples have training value. If the model makes the same mistake on all samples,
those provide near-zero gradient. VG-HNM selects by response variance:
keeping only high-disagreement samples, filtering uninformative low-variance noise.
Result: outperforms full random sampling using only 60% of data.
79.2%(random) → 80.6%(variance gate) ⭐
04
无对齐税:通用能力完好保留
No Alignment Tax: General Capabilities Preserved
RL 微调常因过度优化目标指标而损害模型的通用理解能力(即"对齐税")。
HERO 在 MMBench 上的表现证明:
在大幅降低幻觉指标的同时,通用多模态能力几乎无损。
这是因为熵感知机制自然地保护了模型的不确定性表达能力,
避免了粗暴压缩概率空间导致的表征退化。
RL fine-tuning often damages general understanding capability due to over-optimization ("alignment tax").
HERO's MMBench results prove: dramatically reduced hallucination metrics while preserving near-original
general multimodal capabilities. The entropy-aware mechanism naturally protects the model's uncertainty
representation, avoiding representation degradation from crude probability space compression.
MMBench 保持 73.8(vs 基线 ~74)✦
05
训练更快收敛更好
Faster Convergence, Better Performance
方差门控不仅提升最终性能,还加速了训练过程。
通过剔除无信息样本,每次迭代的梯度信号质量更高,
相同训练步数下 HERO 收敛速度明显快于 vanilla GRPO。
对于需要大量 GPU 资源的 LVLM 微调来说,这意味着显著的成本节省。
Variance gating doesn't just improve final performance — it also accelerates training.
By filtering uninformative samples, gradient signal quality per iteration is higher,
making HERO converge noticeably faster than vanilla GRPO under the same steps.
For GPU-intensive LVLM fine-tuning, this means significant cost savings.
Training efficiency +30% 🚀
06
三模块协同 > 简单叠加
Synergy > Simple Stacking
DEP、VG-HNM、GRPO 不是三个独立 patch,而是紧密协作的系统:
DEP 提供精细化的 token 级监督信号;VG-HNM 保证这些信号来自高质量的样本子集;
GRPO 将两者统一为可微分的端到端优化目标。三者缺一不可。
DEP, VG-HNM, and GRPO are not independent patches but a tightly integrated system:
DEP provides fine-grained token-level supervision signals; VG-HNM ensures these signals come from
high-quality sample subsets; GRPO unifies both into a differentiable end-to-end objective.
All three are indispensable.
System-level co-design ✦