Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Preprint. Under review.

Pritam Sarkar   Ali Etemad
[Paper]
[Code]
[Project Page]


CrissCross

Abstract

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard synchronous cross-modal relations, CrissCross also learns asynchronous cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50 and DCASE). The codes and pretrained models are publicly available.


Results

We present the top-1 accuracy averaged over all the splits of each dataset. Please note that the results mentioned below are obtained by full-finetuning on UCF101 and HMDB51, and linear classifier on ESC50 and DCASE.


Pretraining Dataset Pretraining Size UCF101 HMDB51 ESC50 DCASE Model
Kinetics-Sound 22K 88.3% 60.5% 82.8% 93.0% visual; audio
Kinetics400 240K 91.5% 64.7% 86.8% 96.0% visual; audio
AudioSet 1.8M 92.4% 66.8% 90.5% 97.0% visual; audio

Qualitative Analysis

We visualize the nearest neighborhoods of video-to-video and audio-to-audio retrieval. We use Kinetics-400 to pretrain CrissCross. The pretrained backbones are then used to extract feature vectors from Kinetics-Sound. We use the Kinetics-Sound for this experiment as it consists of action classes which are prominently manifested both audibly and visually. Next, we use the features extracted from the validation split to query the training features.


Video-to-Video Retrievals

The left most video in each row represents the query, and the next 5 videos represent the top-5 neighborhoods. Please see in full-screen for better visibility.







Audio-to-Audio Retrievals

The left most video in each row represents the query, and the next 5 videos represent the top-5 neighborhoods. Please see in full-screen for better visibility.







Citation

Please cite our paper using the given BibTeX entry.


@misc{sarkar2021crisscross,
title={Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity},
author={Pritam Sarkar and Ali Etemad},
year={2021},
eprint={2111.05329},
archivePrefix={arXiv},
primaryClass={cs.CV}}



Acknowledgements

We are grateful to Bank of Montreal and Mitacs for funding this research. We are also thankful to Vector Institute and SciNet HPC Consortium for helping with the computation resources.

Question

You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.