Cross-Modal Emotion Alignment Between Audio and Text Using Embedding Models*
Session Number
3
Advisor(s)
Dr. Phadmakar Patankar, IMSA
Location
A129
Discipline
Computer Science
Start Date
15-4-2026 2:15 PM
End Date
15-4-2026 3:00 PM
Abstract
Recent developments in artificial intelligence enable machine learning algorithms to learn representations of complex information, including language and sounds, in a numerical format. This project seeks to examine if emotional information presented in text can be related to emotional information presented in audio signals. Emotional states presented in text were converted into semantic embeddings using a transformer-based language model, whereas audio signals were converted into feature representations using spectral analysis. The analysis will examine if emotionally related text and audio representations group together in a shared embedding space. The analysis will also examine if it is possible to quantify cross-modal similarity between text and audio signals using cosine similarity measures. The preliminary analysis of this project suggests that some emotional states are related, indicating that machine learning representations of language and sounds can capture commonalities between them. This current study can contribute to the ongoing research on multimodal artificial intelligence by providing a proof-of-concept pipeline on the analysis of emotional correspondence between two types of data. The possible uses of the proposed pipeline are emotion recognition systems, multimedia information retrieval systems, and human-centric artificial intelligence systems
Cross-Modal Emotion Alignment Between Audio and Text Using Embedding Models*
A129
Recent developments in artificial intelligence enable machine learning algorithms to learn representations of complex information, including language and sounds, in a numerical format. This project seeks to examine if emotional information presented in text can be related to emotional information presented in audio signals. Emotional states presented in text were converted into semantic embeddings using a transformer-based language model, whereas audio signals were converted into feature representations using spectral analysis. The analysis will examine if emotionally related text and audio representations group together in a shared embedding space. The analysis will also examine if it is possible to quantify cross-modal similarity between text and audio signals using cosine similarity measures. The preliminary analysis of this project suggests that some emotional states are related, indicating that machine learning representations of language and sounds can capture commonalities between them. This current study can contribute to the ongoing research on multimodal artificial intelligence by providing a proof-of-concept pipeline on the analysis of emotional correspondence between two types of data. The possible uses of the proposed pipeline are emotion recognition systems, multimedia information retrieval systems, and human-centric artificial intelligence systems