Keywords

machine vision, human vision, image transforms, robust learning

Abstract

Shuffling pixels in an image helps machines to learn a more robust object representation. To probe the strategies used by humans and machines for object recognition, we introduce Extreme Image Transformations (EITs). Machines rely heavily on exploiting low-level features like color and texture, so their performance degrades on out-of-distribution and adversarial inputs. Humans depend on high-level features like shapes and contours, making them relatively robust to image distortions. EITs systematically shuffle the pixels in an image, parameterized by the size of grids, probability of shuffle and binary block movement, distorting the structure of objects at both local and global levels. We found that humans maintain high accuracy on globally transformed images while machines perform well on locally transformed images, as shown in our prior work. Adversarial attacks further demonstrate the inability of machines to learn robust representations with popular vision datasets and train- ing paradigms. To close this performance gap, we tested EITs fine-tuned networks on images distorted with common corruptions (weather, gaussian, etc.) We found that EIT fine-tuned networks learn robust object representations, demonstrating improved performance on adversarial attacks, with strong activations in object regions. Fine-tuning with EITs forces networks to focus on essential object features rather than superficial cues, improving generalization. Our work quantifies differences in strategies used in human and artificial vision, highlighting neuro-inspired techniques to enhance neural networks. Understanding differences between biological and artificial vision can enable more human-like computer vision techniques.

Start Date

17-5-2024 9:00 AM

End Date

17-5-2024 10:00 AM

Share

COinS
 
May 17th, 9:00 AM May 17th, 10:00 AM

Extreme Image Transformations Improve Latent Representations in Machines

Shuffling pixels in an image helps machines to learn a more robust object representation. To probe the strategies used by humans and machines for object recognition, we introduce Extreme Image Transformations (EITs). Machines rely heavily on exploiting low-level features like color and texture, so their performance degrades on out-of-distribution and adversarial inputs. Humans depend on high-level features like shapes and contours, making them relatively robust to image distortions. EITs systematically shuffle the pixels in an image, parameterized by the size of grids, probability of shuffle and binary block movement, distorting the structure of objects at both local and global levels. We found that humans maintain high accuracy on globally transformed images while machines perform well on locally transformed images, as shown in our prior work. Adversarial attacks further demonstrate the inability of machines to learn robust representations with popular vision datasets and train- ing paradigms. To close this performance gap, we tested EITs fine-tuned networks on images distorted with common corruptions (weather, gaussian, etc.) We found that EIT fine-tuned networks learn robust object representations, demonstrating improved performance on adversarial attacks, with strong activations in object regions. Fine-tuning with EITs forces networks to focus on essential object features rather than superficial cues, improving generalization. Our work quantifies differences in strategies used in human and artificial vision, highlighting neuro-inspired techniques to enhance neural networks. Understanding differences between biological and artificial vision can enable more human-like computer vision techniques.