Cross-Modal Emotion Alignment Between Audio and Text Using Embedding Models*

Session Number

3

Advisor(s)

Dr. Phadmakar Patankar, IMSA

Location

A129

Discipline

Computer Science

Start Date

15-4-2026 2:15 PM

End Date

15-4-2026 3:00 PM

Abstract

Recent developments in artificial intelligence enable machine learning algorithms to learn representations of complex information, including language and sounds, in a numerical format. This project seeks to examine if emotional information presented in text can be related to emotional information presented in audio signals. Emotional states presented in text were converted into semantic embeddings using a transformer-based language model, whereas audio signals were converted into feature representations using spectral analysis. The analysis will examine if emotionally related text and audio representations group together in a shared embedding space. The analysis will also examine if it is possible to quantify cross-modal similarity between text and audio signals using cosine similarity measures. The preliminary analysis of this project suggests that some emotional states are related, indicating that machine learning representations of language and sounds can capture commonalities between them. This current study can contribute to the ongoing research on multimodal artificial intelligence by providing a proof-of-concept pipeline on the analysis of emotional correspondence between two types of data. The possible uses of the proposed pipeline are emotion recognition systems, multimedia information retrieval systems, and human-centric artificial intelligence systems

Share

COinS
 
Apr 15th, 2:15 PM Apr 15th, 3:00 PM

Cross-Modal Emotion Alignment Between Audio and Text Using Embedding Models*

A129

Recent developments in artificial intelligence enable machine learning algorithms to learn representations of complex information, including language and sounds, in a numerical format. This project seeks to examine if emotional information presented in text can be related to emotional information presented in audio signals. Emotional states presented in text were converted into semantic embeddings using a transformer-based language model, whereas audio signals were converted into feature representations using spectral analysis. The analysis will examine if emotionally related text and audio representations group together in a shared embedding space. The analysis will also examine if it is possible to quantify cross-modal similarity between text and audio signals using cosine similarity measures. The preliminary analysis of this project suggests that some emotional states are related, indicating that machine learning representations of language and sounds can capture commonalities between them. This current study can contribute to the ongoing research on multimodal artificial intelligence by providing a proof-of-concept pipeline on the analysis of emotional correspondence between two types of data. The possible uses of the proposed pipeline are emotion recognition systems, multimedia information retrieval systems, and human-centric artificial intelligence systems