Privacy-preserving social network analysis
Abstract
Data privacy in social networks is a growing concern that threatens to limit access to important information contained in these data structures. Analysis of the graph structure of social networks can provide valuable information for revenue generation and social science research, but unfortunately, ensuring this analysis does not violate individual privacy is difficult. Simply removing obvious identifiers from graphs or even releasing only aggregate results of analysis may not provide sufficient protection. Differential privacy is an alternative privacy model, popular in data-mining over tabular data, that uses noise to obscure individuals' contributions to aggregate results and offers a strong mathematical guarantee that individuals' presence in the data-set is hidden. Analyses that were previously vulnerable to identification of individuals and extraction of private data may be safely released under differential-privacy guarantees. However, existing adaptations of differential privacy to social network analysis are often complex and have considerable impact on the utility of the results, making it less likely that they will see widespread adoption in the social network analysis world. In fact, social scientists still often use the weakest form of privacy protection, simple anonymization, in their social network analysis publications. We review the existing work in graph-privatization, including the two existing standards for adapting differential privacy to network data. We then propose contributor-privacy and partition-privacy , novel standards for differential privacy over network data, and introduce simple, powerful private algorithms using these standards for common network analysis techniques that were infeasible to privatize under previous differential privacy standards. We also ensure that privatized social network analysis does not violate the level of rigor required in social science research, by proposing a method of determining statistical significance for paired samples under differential privacy using the Wilcoxon Signed-Rank Test, which is appropriate for non-normally distributed data. Finally, we return to formally consider the case where differential privacy is not applied to data. Naive, deterministic approaches to privacy protection, including anonymization and aggregation of data, are often used in real world practice. De-anonymization research demonstrates that some naive approaches to privacy are highly vulnerable to reidentification attacks, and none of these approaches offer the robust guarantee of differential privacy. However, we propose that these methods fall across a range of protection: Some are better than others. In cases where adding noise to data is especially problematic, or acceptance and adoption of differential privacy is especially slow, it is critical to have a formal understanding of the alternatives. We define De Facto Privacy, a metric for comparing the relative privacy protection provided by deterministic approaches.
Degree
Ph.D.
Advisors
Clifton, Purdue University.
Subject Area
Computer science
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.