Multimodal Alignment for Audio Descriptions

Supervisors: Lukas Fischer & Sarah Ebling

Summary

This project aims to develop a method for aligning video frames with their corresponding Audio Descriptions (ADs). We plan to use CLIP, a model that can map both text and images into a shared representation space, to identify the most relevant frames for each description through a nearest neighbor search. We will test the effectiveness of our approach by examining its impact on multimodal translation, which uses the aligned frames as additional input to the AD text.

Department of Computational Linguistics

Quicklinks und Sprachwechsel

Main navigation

Multimodal Alignment for Audio Descriptions

Summary