Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

AAAI 2023

Pritam Sarkar Ali Etemad

[Paper]

[Appendix]

[ArXiv]

[Code]

[Website]

[Poster]

Abstract

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard synchronous cross-modal relations, CrissCross also learns asynchronous cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.

Results

We present the top-1 accuracy averaged over all the splits of each dataset. Please note that the results mentioned below are obtained by full-finetuning on UCF101 and HMDB51, and linear classifier on ESC50 and DCASE.

Pretraining Dataset	Pretraining Size	UCF101	HMDB51	ESC50	DCASE	Model
Kinetics-Sound	22K	88.3%	60.5%	82.8%	93.0%	visual; audio
Kinetics400	240K	91.5%	64.7%	86.8%	96.0%	visual; audio
AudioSet	1.8M	92.4%	67.4%	90.5%	97.0%	visual; audio

Qualitative Analysis

We visualize the nearest neighborhoods of video-to-video and audio-to-audio retrieval. We use Kinetics-400 to pretrain CrissCross. The pretrained backbones are then used to extract feature vectors from Kinetics-Sound. We use the Kinetics-Sound for this experiment as it consists of action classes which are prominently manifested both audibly and visually. Next, we use the features extracted from the validation split to query the training features.
Note: Below, the videos will be directly loaded from YouTube, which may take a bit longer based on the Internet connection.

Video-to-Video Retrievals

The left most video in each row represents the query, and the next 5 videos represent the top-5 neighborhoods. Please see in full-screen for better visibility.

Audio-to-Audio Retrievals

The left most video in each row represents the query, and the next 5 videos represent the top-5 neighborhoods. Please see in full-screen for better visibility.

Citation

Please cite our paper using the given BibTeX entry.

@misc{sarkar2021crisscross,
title={Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity},
author={Pritam Sarkar and Ali Etemad},
year={2021},
eprint={2111.05329},
archivePrefix={arXiv},
primaryClass={cs.CV}}

Acknowledgements

We are grateful to Bank of Montreal and Mitacs for funding this research. We are also thankful to SciNet HPC Consortium for helping with the computation resources.

Question

You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.