Extending Synthetic Data and Data Masking Procedures Using Information Theory
Abstract
The introduction of Industry 4.0 is accompanied by further integration of powerful AI/ML tools that allow researchers to perform large-scale analyses such as condition monitoring, data mining, anomaly detection, automated controls, etc. However, the size and complexity of these tools often renders them a ‘double-edged sword’ since they require high volumes of training data, robust algorithms, and optimized parameters that are often expensive/infeasible to procure on a large scale; these limitations typically impose a lack of model validation in underlying methodologies, e.g., detrending, parameter selection, etc., thereby limiting the utility of AI/ML. At the same time, the growing power of these tools presents a threat to industrial systems given a knowledgeable adversary; the risk of network intrusion is only magnified by the power and scope of these tools, and intrusion detection and/or data security tools must adapt in response to this unique and growing threat.This thesis focuses its discussion on two vulnerabilities of the modern paradigm: the lack of algorithmic validation in synthetic data algorithms and the growing vulnerability of industrial data. The well-known mathematical tools introduced by Shannon’s Information Theory, and extensions thereof, provide powerful model-agnostic hedges that can be applied to address either problem. For instance, a typical challenge for synthetic data methodologies is the idea of overfit, which is difficult to consistently avoid, leading to subpar analyses. The NEST algorithm proposed hereinafter circumvents this issue by utilizing an extension of entropy to avoid overfit via the proposed SER metric and generate synthetic data that outperform state-of-the-art methods such as GAN-based algorithms.Information Theory can also be utilized to safeguard data from powerful AI/ML tools by virtue of inferential evaluation, which allows us to identify a safe and efficient obfuscation of the given data. State-of-the-art data masking techniques illicit compromises in efficiency, data utility, and privacy guarantees that are impractical for industrial-scale proprietary systems; the proposed DIOD paradigm overcomes each of these issues by masking the data in an informationpreserving manner well-suited to various industrial tasks, datatypes, and downstream analyses that has shown a significant advantage over similar techniques.
Degree
M.Sc.
Advisors
Abdel-Khalik, Purdue University.
Subject Area
Artificial intelligence
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.