Keywords

saliency, deep learning, high-level, low-level

Abstract

Learning what properties of an image are associated with human gaze placement is important both for understanding how biological systems explore the environment and for computer vision applications. Recent advances in deep learning for the first time enable us to explain a significant portion of the information expressed in the spatial fixation structure. Our saliency model DeepGaze II uses the VGG network (trained on object recognition in the ImageNet challenge) to convert an image into a high-dimensional feature space which is then readout by a second very simple network to yield a density prediction. DeepGaze II is right now the best performing model for predicting fixations when freeviewing still images (MIT Saliency Benchmark, AUC and sAUC). By retraining on other datasets, we can explore how the features driving fixations change over different tasks or over presentation time. Additionally, the modular architecture of DeepGaze II allows us to quantify how predictive certain features are for fixations. We demonstrate this by replacing the VGG network with very simple isotropic mean-luminance-contrast features and end up with a network that outperforms all previous saliency models before the models that used pretrained deep networks (including models with high-level features like Judd or eDN). Using DeepGaze and the Mean-Luminance-Contrast model (MLC), we can separate how much low-level and high-level features contribute to fixation selection in different situations.

Start Date

19-5-2017 10:32 AM

End Date

19-5-2017 10:54 AM

Share

COinS
 
May 19th, 10:32 AM May 19th, 10:54 AM

Predicting Fixations From Deep and Low-Level Features

Learning what properties of an image are associated with human gaze placement is important both for understanding how biological systems explore the environment and for computer vision applications. Recent advances in deep learning for the first time enable us to explain a significant portion of the information expressed in the spatial fixation structure. Our saliency model DeepGaze II uses the VGG network (trained on object recognition in the ImageNet challenge) to convert an image into a high-dimensional feature space which is then readout by a second very simple network to yield a density prediction. DeepGaze II is right now the best performing model for predicting fixations when freeviewing still images (MIT Saliency Benchmark, AUC and sAUC). By retraining on other datasets, we can explore how the features driving fixations change over different tasks or over presentation time. Additionally, the modular architecture of DeepGaze II allows us to quantify how predictive certain features are for fixations. We demonstrate this by replacing the VGG network with very simple isotropic mean-luminance-contrast features and end up with a network that outperforms all previous saliency models before the models that used pretrained deep networks (including models with high-level features like Judd or eDN). Using DeepGaze and the Mean-Luminance-Contrast model (MLC), we can separate how much low-level and high-level features contribute to fixation selection in different situations.