Reasoning across language and vision in machines and humans

Andrei Barbu, Purdue University

Abstract

Humans not only outperform AI and computer-vision systems, but use an unknown computational mechanism to perform tasks for which no suitable approaches exist. I present work investigating both novel tasks and how humans approach them in the context of computer vision and linguistics. I demonstrate a system which, like children, acquires high-level linguistic knowledge about the world. Robots learn to play physically-instantiated board games and use that knowledge to engage in physical play. To further integrate language and vision I develop an approach which produces rich sentential descriptions of events depicted in videos. I then show how to simultaneously detect and track objects, recognize events, and produce sentences. This tighter integration of language and vision enables a novel task: sentential video retrieval. A video corpus can be searched for clips which depict a target sentence rather than just a collection of individual query words. This work assumes a compositional representation of events, composing sentence models from word models. Perhaps the reason why humans perform tasks such as the above with ease is because of a tight integration of language and vision exploiting the compositionality inherent in both modalities. I present work indicating that this may be the case. Humans are shown videos while fMRI data is acquired and sentences which describe those videos are recovered compositionally.

Degree

Ph.D.

Advisors

Siskind, Purdue University.

Subject Area

Neurosciences|Robotics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS