Navigation auf uzh.ch
Supervisors: Lukas Fischer & Sarah Ebling
This project aims to develop a method for aligning video frames with their corresponding Audio Descriptions (ADs). We plan to use CLIP, a model that can map both text and images into a shared representation space, to identify the most relevant frames for each description through a nearest neighbor search. We will test the effectiveness of our approach by examining its impact on multimodal translation, which uses the aligned frames as additional input to the AD text.