Date of Award
12-2017
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
Committee Chair
Mohammad Al Hasan
Committee Co-Chair
Christopher W. Clifton
Committee Member 1
Dan Goldwasser
Committee Member 2
Xia Ning
Committee Member 3
Ninghui Li
Abstract
In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data.
Recommended Citation
Zhang, Baichuan, "Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data" (2017). Open Access Dissertations. 1671.
https://docs.lib.purdue.edu/open_access_dissertations/1671