Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts

NeurIPS 2023 (Spotlight)

Pritam Sarkar Ahmad Beirami Ali Etemad

[Paper]

[Code]

[Website]

Sample video frames of distribution shifts. In these examples, the left frames of each category represent an In-distribution sample and the right frames represent an out-of-distribution sample.

Abstract

Video self-supervised learning (VSSL) has made significant progress in recent years. However, the exact behavior and dynamics of these models under different forms of distribution shift are not yet known. In this paper, we comprehensively study the behavior of six popular self-supervised methods (v-SimCLR, v-MOCO, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown classes (zero-shot), and (vi) open-set recognition. To perform this extensive study, we carefully craft a test bed consisting of 17 in-distribution and out-of-distribution benchmark pairs using available public datasets and a series of evaluation protocols to stress-test the different methods under the intended shifts. Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods. For instance, we observe that while video models generally struggle with context shifts, v-MAE and supervised learning exhibit more robustness. Moreover, our study shows that v-MAE is a strong temporal learner, whereas contrastive methods, v-SimCLR and v-MOCO, exhibit strong performances against viewpoint shifts. When studying the notion of open-set recognition, we notice a trade-off between closed-set and open-set recognition performance if the pretrained VSSL encoders are used without finetuning. We hope that our work will contribute to the development of robust video representation learning frameworks for various real-world scenarios.

Contributions

Key Insights

Q1. How do the learned spatial and temporal representations vary based on different VSSL pretraining methodologies? How robust are these representations to different distribution shifts?


(a) Experiment on disentangled temporal representation.	(b) Experiment on Viewpoint invariance.	(c) Experiment on robustness against low resolution inputs.

Highlights

v-Supervised & v-MAE are strong temporal learners, hence, more robust under context shift compared to contrastive and non-contrastive methods.
Contrastive methods are more robust to viewpoint shifts compared to the non-contrastive methods, and v-MAE and v-Supervised perform worse under viewpoint shifts.
v-MAE is robust against extremely low-resolution inputs.
v-Supervised is extremely vulnerable in complex scenarios when multiple distribution shifts are applied concurrently.

Q2. How does finetuning influence the out-of-distribution generalization and zero-shot performance of VSSL?


(a) Comparing performance under real-world distribution shifts.	(b) Comparing performance under synthetic temporal perturbations.

Highlights

Finetuning generally helps VSSL in both In-distribution and out-of-distribution.
The benefits of finetuning largely vary between different VSSL methods and the type of distribution shifts.
Finetuning provides more benefits against actor shifts in comparison to viewpoint shifts.
The benefits of finetuning drops in more complex setup of context shift.
Finetuning can also degrade performance under source shift depending on the quality of the training benchmark.
Finetuning degrades robustness to temporal perturbations as it impairs the time-invariant representations of contrastive and non-contrastive methods.

Q3. How do VSSL methods perform on open-set problems? And what is the relationship between performance in closed-set vs. open-set recognition?


(a) Kinetics400/UCF (FT.) Comparing open macro-F1 scores vs. openness.	(b) Kinetics400/HMDB (FT.) Comparing open macro-F1 scores vs. openness.	(c) UCF101/HMDB (FT.) Comparing open macro-F1 scores vs. openness.	(d) The relationships between closed-set and open-set (Frozen)

Highlights

Contrastive methods demonstrate superior performance in open-set recognition when finetuned.
There is a trade-off between closed-set and open-set recognition when frozen pretrained encoders are used.
Strong frozen encoders have poor open-set performance. Whereas, slightly weak frozen encoders show better performance. Frozen v-MAE performs poorly in both open and closed set.

Q4. Do different VSSL methods exhibit comparable decision-making patterns (`decision similarity') given the same training conditions? And how is this impacted by different distribution shifts?

The decision similarity between the video models in In-distribution (top) vs. out-of-distribution (bottom).

Highlights

The decision similarity decreases under distribution shifts, which further varies based on the type of shift.
Context and source shifts cause the most dissimilarity between decisions.
The decision similarity between the v-Supervised and VSSL; & v-MAE and other VSSL methods exhibit the least.

Read our paper for more insights!

Citation

Please cite our paper using the given BibTeX entry.

    @misc{sarkar2023ood,
      title={Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts}, 
      author={Pritam Sarkar and Ahmad Beirami and Ali Etemad},
      year={2023},
      eprint={2306.02014},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }

Contact me:

You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.
⭐ I am looking for internship opportunity in related areas; if you have an opening, please feel free to reach out. ⭐