Probabilistic approaches to entity retrieval

Yi Fang, Purdue University

Abstract

Entity retrieval has emerged as a nascent Information Retrieval (IR) research area aiming at satisfying increasingly sophisticated user information needs that go beyond document retrieval. One specialized type of entity retrieval is expert search which has attracted much attention in the IR community since the launch of TREC Enterprise in 2005. This dissertation proposes formal probabilistic approaches to entity retrieval. In particular, we present a discriminative learning framework for expert search. Specific discriminative models can be derived from the framework by instantiating the parametric conditional probability of relevance given the expert and query pair. Based on the framework, we design discriminative models for both scenarios where the document-candidate associations are ambiguous (i.e., the TREC setting) or unambiguous. Within the setting of unambiguous associations, we study scenarios where expertise information comes from heterogeneous sources. Moreover, we explore pairwise discriminative models to learn expert ranking functions from implicit user feedback. We demonstrate the advantages of discriminative models over generative models. In addition to formal theoretical models, we present a real-world expert search system for academic institutions: INdiana Database of University Research Expertise (INDURE). The major components of INDURE are analyzed and discussed along with the underlying rationale and design decisions. One of the important components of such systems, homepage discovery, is investigated in detail. Specifically, we propose a discriminative probabilistic model by capturing the dependence among all the candidate homepages. We also present an analysis of the INDURE query log to understand the special characteristics of expert search usage. Beyond expert search, we investigate the general entity retrieval tasks that are defined by TREC Entity. We propose unified generative probabilistic models to formalize the process of entity retrieval. The proposed models incorporate entity relevance, type estimation, type matching, entity prior and entity co-occurrence into a holistic probabilistic framework. The proposed unified probabilistic approach is also applied to the semi-structured semantic data which is increasingly available online. We extensively evaluate the proposed models across various settings, and conduct a systematic analysis of the experimental results. We demonstrate that our proposed probabilistic approaches are robust yet deliver very competitive performance on entity retrieval.

Degree

Ph.D.

Advisors

Si, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS