Keywords

Anatomically-inspired models, Inverted Face Effect, log-polar tranform

Abstract

Convolutional Neural Networks (CNNs) are currently the best models we have of the ventral temporal lobe – the part of cortex engaged in recognizing objects. They have been effective at predicting the firing rates of neurons in monkey cortex, as well as fMRI and MEG responses in human subjects. They are based on several observations concerning the visual world: 1) pixels are most correlated with nearby pixels, leading to local receptive fields; 2) stationary statistics – the statistics of image pixels are relatively invariant across the visual field, leading to replicated features 3) objects do not change identity depending on their location in the image, leading to pooling of responses, making CNNs relatively translation invariant; and 4) objects are made of parts, leading to increasing receptive field sizes in deeper layers, so smaller parts are recognized in shallower layers, and larger composites in later layers. However, compared to the primate visual system, there are a couple of striking differences. CNNs have high resolution everywhere, whereas primates have a foveated retina, with high resolution for humans only about the size of your thumbnail at arm’s length, and steep dropoff in resolution towards the periphery. The mapping from the visual field to V1 is a log-polar transform. This has two main advantages: scale is just a left-right translation, and rotation in the image plane is a vertical translation. When these are given as input to a standard CNN, scale and rotation invariance is obtained. However, translation invariance is lost, which we make up for by moving our eyes about 3 times a second. We present results from a model with these constraints, and show that, despite rotation invariance, the model is able to capture the inverted face effect, while standard CNNs do not.

Start Date

17-5-2024 10:30 AM

End Date

17-5-2024 11:30 AM

Location

UCSD

Share

COinS
 
May 17th, 10:30 AM May 17th, 11:30 AM

Euclidean Coordinates are the Wrong Prior for Models of Primate Vision

UCSD

Convolutional Neural Networks (CNNs) are currently the best models we have of the ventral temporal lobe – the part of cortex engaged in recognizing objects. They have been effective at predicting the firing rates of neurons in monkey cortex, as well as fMRI and MEG responses in human subjects. They are based on several observations concerning the visual world: 1) pixels are most correlated with nearby pixels, leading to local receptive fields; 2) stationary statistics – the statistics of image pixels are relatively invariant across the visual field, leading to replicated features 3) objects do not change identity depending on their location in the image, leading to pooling of responses, making CNNs relatively translation invariant; and 4) objects are made of parts, leading to increasing receptive field sizes in deeper layers, so smaller parts are recognized in shallower layers, and larger composites in later layers. However, compared to the primate visual system, there are a couple of striking differences. CNNs have high resolution everywhere, whereas primates have a foveated retina, with high resolution for humans only about the size of your thumbnail at arm’s length, and steep dropoff in resolution towards the periphery. The mapping from the visual field to V1 is a log-polar transform. This has two main advantages: scale is just a left-right translation, and rotation in the image plane is a vertical translation. When these are given as input to a standard CNN, scale and rotation invariance is obtained. However, translation invariance is lost, which we make up for by moving our eyes about 3 times a second. We present results from a model with these constraints, and show that, despite rotation invariance, the model is able to capture the inverted face effect, while standard CNNs do not.