Keywords

deep learning, segmentation

Abstract

Many animals and humans can recognize and segment objects from their backgrounds. Whether object segmentation is necessary for object recognition has long been a topic of debate. Deep neural networks (DNNs) excel at object recognition, but not at segmentation tasks - this has led to the belief that object recognition and segmentation are separate mechanisms in visual processing. Here, however, we show evidence that in variational autoencoders (VAEs), segmentation and faithful representation of data can be interlinked. VAEs are encoder-decoder models that learn to represent independent generative factors of the data as a distribution in a very small bottleneck layer; specifically, we show that VAEs can be made to segment objects without any additional finetuning or downstream training. This segmentation is achieved with a procedure that we call the latent space noise trick: by perturbing the activity of the bottleneck units with activity-independent noise, and recurrently recording and clustering decoder outputs in response to these small changes, the model is able to segment and bind separate features together. We demonstrate that VAEs can group elements in a human-like fashion, are robust to occlusions, and produce illusory contours in simple stimuli. Furthermore, the model generalizes to the naturalistic setting of faces, producing meaningful subpart and figure-ground segmentation without ever having been trained on segmentation. For the first time, we show that learning to faithfully represent stimuli can be generally extended to segmentation using the same model backbone architecture without any additional training.

Share

COinS
 

How object segmentation and perceptual grouping emerge in noisy variational autoencoders

Many animals and humans can recognize and segment objects from their backgrounds. Whether object segmentation is necessary for object recognition has long been a topic of debate. Deep neural networks (DNNs) excel at object recognition, but not at segmentation tasks - this has led to the belief that object recognition and segmentation are separate mechanisms in visual processing. Here, however, we show evidence that in variational autoencoders (VAEs), segmentation and faithful representation of data can be interlinked. VAEs are encoder-decoder models that learn to represent independent generative factors of the data as a distribution in a very small bottleneck layer; specifically, we show that VAEs can be made to segment objects without any additional finetuning or downstream training. This segmentation is achieved with a procedure that we call the latent space noise trick: by perturbing the activity of the bottleneck units with activity-independent noise, and recurrently recording and clustering decoder outputs in response to these small changes, the model is able to segment and bind separate features together. We demonstrate that VAEs can group elements in a human-like fashion, are robust to occlusions, and produce illusory contours in simple stimuli. Furthermore, the model generalizes to the naturalistic setting of faces, producing meaningful subpart and figure-ground segmentation without ever having been trained on segmentation. For the first time, we show that learning to faithfully represent stimuli can be generally extended to segmentation using the same model backbone architecture without any additional training.