Contrastive Multi-Modal Video Transformer

Advisor(s)

Dr. Ismini Lourentzou; Virginia Tech

Dr. Chengxiang Zhai; University of Illinois

Discipline

Computer Science

Start Date

21-4-2021 10:25 AM

End Date

21-4-2021 10:40 AM

Abstract

Transformer networks have shown great promise in video classification and understanding tasks by reducing the dependency on recurrent networks, and instead using self-attention techniques. Recurrent techniques are often not suitable for videos/data with long-term temporal dependencies due to the vanishing gradient problem, as well as the inability to fully backpropagate due to computational power restrictions. By using self-attention, a neural network can learn long-term dependencies with lower computational requirements and higher accuracy. We aim to increase the performance of self-supervised video transformer networks by contrasting frame-level local representations with the video-level global representations and determining how contrasting different representations from different data modalities may increase accuracy.

Share

COinS
 
Apr 21st, 10:25 AM Apr 21st, 10:40 AM

Contrastive Multi-Modal Video Transformer

Transformer networks have shown great promise in video classification and understanding tasks by reducing the dependency on recurrent networks, and instead using self-attention techniques. Recurrent techniques are often not suitable for videos/data with long-term temporal dependencies due to the vanishing gradient problem, as well as the inability to fully backpropagate due to computational power restrictions. By using self-attention, a neural network can learn long-term dependencies with lower computational requirements and higher accuracy. We aim to increase the performance of self-supervised video transformer networks by contrasting frame-level local representations with the video-level global representations and determining how contrasting different representations from different data modalities may increase accuracy.