Dealing with ambiguous and partial supervision in complex information retrieval applications

Dan Zhang, Purdue University

Abstract

The huge amount of information from the Internet and other digital repositories has become an obstacle for people to obtain and organize information. This demands different types of effective information retrieval solutions. However, many information retrieval applications with learning techniques involve ambiguous information. In this thesis, I focus mainly on solving three important problems in this topic, which include the label ambiguity problem, the label incompleteness problem and some ambiguity problems in social media. The label ambiguity problem exists when users are interested only in parts of each object. For example, in text mining, it is highly possible that only some parts of each webpage are related to a topic/label. In this case, if each webpage is represented by just one feature vector, the features of relevant parts may be buried by those of the irrelevant parts. The label incompleteness problem exists in cases when only an incomplete list of labels in the training set is available. For example, in webpage classification for sports, it might be possible to obtain a partial list of common classes, such as football, cycling, and swimming; but the webpages on new topics can appear as the data arrives. As a result, the ambiguity problem arises since the obtained classifier is normally not applicable to new categories. In social media, information, such as sentiments towards a product, spreads in complex social networks. In this scenario, the problem of ambiguity also exists. For example, microblog sites like Twitter require that the length of each post be limited, and this limitation on the length of posts brings some extent of ambiguity to the learning problem. Although in the aforementioned three problems, only some ambiguous supervision is available, the ambiguity problem can be alleviated by leveraging some valuable partial knowledge. For example, in the label ambiguity problem of text mining, some structure information, such as link relationships may exist. These link relationships (partial knowledge) can be used to help design the classifiers more accurately. In this thesis, I investigate how to utilize different kinds of partial knowledge to alleviate the problem of ambiguity in different scenarios.

Degree

Ph.D.

Advisors

Si, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS