A learning approach for relevance and diversity in federated search

Dung T Hong, Purdue University


Federated search allows the simultaneous searching of multiple information sources. It is highly useful in the hidden web environment, which consists of documents that are difficult to obtain by traditional search engines such as Google or Bing. The fact that the hidden web is huge and contains valuable information has challenged federated search to look beyond the traditional searching model, in order to return a comprehensive list of documents using a limited number of information sources, while balancing document relevance and novelty. Existing research on federated search mostly focuses on document relevance using unsupervised and semi-supervised algorithms, with limited work on using machine learning approach for major problems such as resource selection and result merging. Furthermore, as modern information retrieval has gradually evolved to provide more utility in the final ranked list presented to users, the balance between relevance and coverage of different query aspects becomes more important. Yet, few learning models have been studied for relevance and diversity in federated search. This thesis focuses on relevance and diversity in two major stages of federated search (resource selection and result merging) with a novel machine learning approach. New algorithms include a probabilistic joint model and query expansion that are more effective in selecting relevant information sources. We also propose a learning approach for better merging documents from individual sources, and tackle the problem of relevance-novelty balancing in resource selection by combining multiple diversification methods with some training queries. The effectiveness of the new research is demonstrated by an extensive set of experiments over multiple datasets.




Si, Purdue University.

Subject Area

Computer Engineering|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server