Grounding language in video
Linking natural language to visual data is an important topic at the intersection of Natural Language Processing and Computer Vision. It has various applications such as image/video captioning or language-based video retrieval. I consider two closely-related research tasks of this topic in this dissertation: i) The first one is to develop methods that learn to describe video data with natural language. I present two different methods that learn representations of word meanings in a weakly supervised fashion, from short video clips paired with sentences. I start by training only positive sentences that are true about the video, followed by adding into negative training sentences that are false to improve the performance. The learned word meanings are then used to automatically generate description of new video. ii) The second task is to use natural language to guide the solution of computer vision problems, such as video object codetection. I propose a language-directed framework for video object codetection, where sentences are paired with videos to constrain the search space for the object proposals in those videos. This allows objects of different classes to be detected simultaneously in multiple videos, under the guidance of the semantics of the paired sentences. These two tasks are related in the sense that one can be treated as a reverse process of the other. To successfully achieve our goal, we need to bridge the gap between high-level semantics of sentences and low-level visual features extracted from image pixels.
Siskind, Purdue University.
Computer Engineering|Artificial intelligence|Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our