XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

AAAI 2024.

Pritam Sarkar   Ali Etemad
[Supplementary material]

Play the vide to see how XKD works! The video has no sound.


We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams.


Ablation study


Effect of refinement


SOTA Comparison



Please cite our paper using the given BibTeX entry.

title={XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning},
author={Pritam Sarkar and Ali Etemad},


We are grateful to Bank of Montreal and Mitacs for funding this research. We are also thankful to SciNet HPC Consortium for helping with the computation resources.


You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.