Keywords

pattern theory, video event understanding, human activity recognition, compositional approach

Abstract

We propose a combinatorial approach built on Grenander’s pattern theory to generate semantic interpretations of video events of human activities. The basic units of representations, termed generators, are linked with each other using pairwise connections, termed bonds, that satisfy predefined relations. Different generators are specified for different levels, from (image) features at the bottom level to (human) actions at the highest, providing a rich representation of items in a scene. The resulting configurations of connected generators provide scene interpretations; the inference goal is to parse given video data and generate high-probability configurations. The probabilistic structures are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints and concept co-occurrence frequency values). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data. This framework can potentially handle clutter, i.e. objects/actions that are not related to the main activity, and can infer actions despite some unobserved components. We evaluated the performance of our pattern-theoretic framework using video segments from the YouCook dataset. We verified an overall improvement of more than 50% in recall and 100% in precision compared to a purely machine learning based approach.

Start Date

14-5-2015 6:05 PM

End Date

14-5-2015 6:30 PM

Session Number

04

Session Title

Theory

Share

COinS
 
May 14th, 6:05 PM May 14th, 6:30 PM

Video Event Understanding with Pattern Theory

We propose a combinatorial approach built on Grenander’s pattern theory to generate semantic interpretations of video events of human activities. The basic units of representations, termed generators, are linked with each other using pairwise connections, termed bonds, that satisfy predefined relations. Different generators are specified for different levels, from (image) features at the bottom level to (human) actions at the highest, providing a rich representation of items in a scene. The resulting configurations of connected generators provide scene interpretations; the inference goal is to parse given video data and generate high-probability configurations. The probabilistic structures are imposed using energies that have contributions from both data (classification scores) and prior information (ontological constraints and concept co-occurrence frequency values). The search for optimal configurations is based on an MCMC, simulated-annealing algorithm that uses simple moves to propose configuration changes and to accept/reject them according to the posterior energy. In contrast to current graphical methods, this framework does not preselect a neighborhood structure but tries to infer it from the data. This framework can potentially handle clutter, i.e. objects/actions that are not related to the main activity, and can infer actions despite some unobserved components. We evaluated the performance of our pattern-theoretic framework using video segments from the YouCook dataset. We verified an overall improvement of more than 50% in recall and 100% in precision compared to a purely machine learning based approach.