Trojan Attacks and Defenses on Deep Neural Networks

Yingqi Liu, Purdue University

Abstract

With the fast spread of machine learning techniques, sharing and adopting public deep neural networks become very popular. As deep neural networks are not intuitive for human to understand, malicious behaviors can be injected into deep neural networks undetected. We call it trojan attack or backdoor attack on neural networks. Trojaned models operate normally when regular inputs are provided, and misclassify to a specific output label when the input is stamped with some special pattern called trojan trigger. Deploying trojaned models can cause various severe consequences including endangering human lives (in applications like autonomous driving). Trojan attacks on deep neural networks introduce two challenges. From the attacker’s perspective, since the training data or training process is usually not accessible to the attacker, the attacker needs to find a way to carry out the trojan attack without access to training data. From the user’s perspective, the user needs to quickly scan the online public deep neural networks and detect trojaned models. We try to address these challenges in this dissertation. For trojan attack without access to training data, We propose to invert the neural network to generate a general trojan trigger, and then retrain the model with reverse-engineered training data to inject malicious behaviors to the model. The malicious behaviors are only activated by inputs stamped with the trojan trigger. To scan and detect trojaned models, we develop a novel technique that analyzes inner neuron behaviors by determining how output activations change when we introduce different levels of stimulation to a neuron. A trojan trigger is then reverse-engineered through an optimization procedure using the stimulation analysis results, to confirm that a neuron is truly compromised. Furthermore, for complex trojan attacks, we propose a novel complex trigger detection method. It leverages a novel symmetric feature differencing method to distinguish features of injected complex triggers from natural features. For trojan attacks on NLP models, we propose a novel backdoor scanning technique. It transforms a subject model to an equivalent but differentiable form. It then inverts a distribution of words denoting their likelihood in the trigger and applies a novel word discriminativity analysis to determine if the subject model is particularly discriminative for the presence of likely trigger words.

Degree

Ph.D.

Advisors

Zhang, Purdue University.

Subject Area

Artificial intelligence

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS