Abstract: The main purpose of multimodal machine translation (MMT) is to improve the quality of translation results by taking the corresponding visual context as an additional input. Recently many ...
VideoPrism is a general-purpose video encoder designed to handle a wide spectrum of video understanding tasks, including classification, retrieval, localization, captioning, and question answering. It ...
Abstract: Transformers are widely used in natural language processing and computer vision, and Bidirectional Encoder Representations from Transformers (BERT) is one of the most popular pre-trained ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results