Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization

Preprint. Under review.

Pritam Sarkar   Ali Etemad
Paper
Website
Code
Hugging Face
Models
Data
Our contributions
    👉 We design a self-alignment framework to facilitate self-improvement of LVLMs based on their own errors. We introduce RRPO, a preference optimization method that addresses the limitations of DPO by utilizing sub-sequence-level refined rewards and token-wise strong KL regularizer, resulting in more precise alignment and stable training.

    👉 Our rigorous evaluation demonstrates the effectiveness of our proposed method across diverse video tasks, including video hallucination, short and long video understanding, and fine-grained temporal reasoning, among others. Moreover, our experimental and theoretical analysis highlight the superiority of RRPO over DPO in aligning LVLMs.


An overview of our self-alignment framework.

An example of perturbed video. A few training samples.


An Overview of Refined Regularized Preference Optimization

    Given an input \(x\) with a pair of responses \(\{y^+, y^-\}\), where \(y^+ \succ y^- | x\), we align \(\pi_\theta\) to favor \(y^+\) over \(y^-\). RRPO training objective is defined as:

    \[ \mathcal{L}_\text{RRPO}(\pi_{\theta};\pi_\text{ref}) = -\mathbb{E}_{(x,y^+,y^-) \sim \mathcal{D}} \left[ \log \sigma (u) + \alpha \cdot \mathbb{D}_{\text{TKL}} \big(x,y^+\big) \right] \]

    the total reward margin \(u\) is defined as: \[ u = \sum\limits_{i=1}^{N} u_i = \sum\limits_{i=1}^{N} \bigl( r_\theta(x, y^+_i) - r_\theta(x, y^-_i) \bigr) \]

    the reward for \(i^{th}\) phrase \(r_\theta(x, y_i)\) is defined as: \[ r_\theta(x, y_i) = \beta \log \left( \frac{ \prod\limits_{j=s_i}^{e_i} \pi_\theta(t_j \mid x, t_{\lt j}) } { \prod\limits_{j=s_i}^{e_i} \pi_{\text{ref}}(t_j \mid x, t_{\lt j}) } \right) \] where \({s_i}\) and \({e_i}\) are the start and end token indices of \(i^{th}\) phrase

    the token-wise KL regularizer \(\mathbb{D}_{\text{TKL}}\) is defined as: \[ \mathbb{D}_{\text{TKL}} \big(x,y^+;\pi_{\text{ref}} \,\|\, \pi_{\theta} \big) = \sum\limits_{t=1}^{|y^+|} \mathbb{D}_{\text{KL}} \left( \pi_{\text{ref}} (\cdot \mid [x, y^+_{\lt t}]) \,\|\, \pi_{\theta} (\cdot \mid [x, y^+_{\lt t}]) \right) \]

Comparison with existing preference optimization methods. RRPO consistently outperforms existing alignment methods.
TVBench VideoHallucer VideoMME MLVU Δ / %Δ
LongVU7B (base) 53.7 39.2 56.2 63.6
+ DPO 54.3 40.9 56.6 63.6 0.7 / 1.5
+ DPA 54.6 40.3 56.9 63.9 0.7 / 1.5
+ TDPO 53.9 41.4 57.0 63.8 0.8 / 1.9
+ DDPO 54.2 41.7 56.7 63.6 0.9 / 2.0
+ RRPO (ours) 56.5 44.0 57.7 64.5 2.5 / 5.4


Performance relative to model divergence. RRPO exhibits the best performance- divergence trade-off.


RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks on diverse video tasks.
Models #F TV Bench Temp Compass Video Hallucer Vid Halluc MV Bench Video MME MLVU LongVideo Bench
VideoChat2 16 44.0 59.3 23.1 73.3 60.2 41.0 46.4 40.4
VideoChat2 + DPO 16 45.7 60.0 22.1 72.4 59.6 43.0 47.4 41.0
VideoChat2 + RRPO 16 45.8 60.2 32.9 76.4 59.0 44.3 47.9 42.8
 
LLaVA-Video 64 51.0 66.0 50.0 76.6 61.1 64.0 68.6 60.1
LLaVA-Video + DPO 64 51.9 66.4 53.3 76.5 60.6 63.1 67.4 59.4
LLaVA-Video + RRPO 64 51.9 66.8 55.7 76.5 62.2 64.5 69.1 60.4
LLaVA-Video + RRPO (32f) 64 52.2 67.4 55.8 76.6 62.1 64.5 69.4 60.1
 
LongVU 1fps 53.7 63.9 39.2 67.3 65.5 56.2 63.6 48.6
LongVU + DPO 1fps 54.3 64.3 40.9 68.5 65.9 56.6 63.6 49.4
LongVU + RRPO 1fps 56.5 64.5 44.0 71.7 66.8 57.7 64.5 49.7


Highlights
    👉 RRPO is more stable and highly effective compared to prior and concurrent preference optimization methods.
    👉 The fine-grained reward modeling in RRPO improves capabilities without diverging away from its initial state, thus preserve the valuable prior knowledge.
    👉 RRPO scales effectively with more data and high-resolution inputs, and generalizes well across diverse LVLMs.
    👉 RRPO exhibits consistent improvements across all setups over the base models in diverse video tasks.

Read our paper for more insights!

Citation

Please cite our paper using the given BibTeX entry.




Contact me:

You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.
I am on the job market for a full-time role as a researcher. If you find my experience a good fit, please reach out.