Semantic Contrastive Multi-Modal Video Transformer

Session Number

Project ID: CMPS 07

Advisor(s)

ChengXiang Zhai, University of Illinois

Dr. Ismini Lourentzou, Virginia Tech

Discipline

Computer Science

Start Date

20-4-2022 9:10 AM

End Date

20-4-2022 9:25 AM

Abstract

We present an architecture for learning semantic multi-modal video representations to learn semantic representations of videos from unlabeled data with transformer architectures. While multi-modal transformer architectures have been shown to increase accuracy of video classification and feature learning tasks, these techniques do not incorporate semantic information. Our Semantic Contrastive Multi-Modal Video Transformer (SCMMVT) takes raw video, audio, and text data and generates semantic multi-modal representations that represent connections and relations between portions of the video. We integrate multiple pre-trained architectures and evaluate feature extraction performance with video action recognition downstream tasks.

Share

COinS
 
Apr 20th, 9:10 AM Apr 20th, 9:25 AM

Semantic Contrastive Multi-Modal Video Transformer

We present an architecture for learning semantic multi-modal video representations to learn semantic representations of videos from unlabeled data with transformer architectures. While multi-modal transformer architectures have been shown to increase accuracy of video classification and feature learning tasks, these techniques do not incorporate semantic information. Our Semantic Contrastive Multi-Modal Video Transformer (SCMMVT) takes raw video, audio, and text data and generates semantic multi-modal representations that represent connections and relations between portions of the video. We integrate multiple pre-trained architectures and evaluate feature extraction performance with video action recognition downstream tasks.