Semantic Contrastive Multi-Modal Video Transformer
Session Number
Project ID: CMPS 07
Advisor(s)
ChengXiang Zhai, University of Illinois
Dr. Ismini Lourentzou, Virginia Tech
Discipline
Computer Science
Start Date
20-4-2022 9:10 AM
End Date
20-4-2022 9:25 AM
Abstract
We present an architecture for learning semantic multi-modal video representations to learn semantic representations of videos from unlabeled data with transformer architectures. While multi-modal transformer architectures have been shown to increase accuracy of video classification and feature learning tasks, these techniques do not incorporate semantic information. Our Semantic Contrastive Multi-Modal Video Transformer (SCMMVT) takes raw video, audio, and text data and generates semantic multi-modal representations that represent connections and relations between portions of the video. We integrate multiple pre-trained architectures and evaluate feature extraction performance with video action recognition downstream tasks.
Semantic Contrastive Multi-Modal Video Transformer
We present an architecture for learning semantic multi-modal video representations to learn semantic representations of videos from unlabeled data with transformer architectures. While multi-modal transformer architectures have been shown to increase accuracy of video classification and feature learning tasks, these techniques do not incorporate semantic information. Our Semantic Contrastive Multi-Modal Video Transformer (SCMMVT) takes raw video, audio, and text data and generates semantic multi-modal representations that represent connections and relations between portions of the video. We integrate multiple pre-trained architectures and evaluate feature extraction performance with video action recognition downstream tasks.