Overcoming uncertainty for within-network relational machine learning

Joseph J. Pfeiffer, Purdue University

Abstract

People increasingly communicate through email and social networks to maintain friendships and conduct business, as well as share online content such as pictures, videos and products. Relational machine learning (RML) utilizes a set of observed attributes and network structure to predict corresponding labels for items; for example, to predict individuals engaged in securities fraud, we can utilize phone calls and workplace information to make joint predictions over the individuals. However, in large scale and partially observed network domains, missing labels and edges can significantly impact standard relational machine learning methods by introducing bias into the learning and inference processes. In this thesis, we identify the effects on parameter estimation, correct the biases, and model the uncertainty of the missing data to improve predictive performance. In particular, we investigate this issue on a variety of modeling scenarios and prediction problems. First, we introduce the Transitive Chung Lu random graph model for modeling the conditional distribution of edges given a partially observed network. This model fits within a class of scalable generative graph models with scalable sampling processes that we generalize to model distributions of networks with correlated attribute variables via Attributed Graph Models. Second, we utilize TCL to incorporate edge probabilities into relational learning and inference models for partially observed network domains. As part of this work, give a linear time algorithm to perform variational inference over a squared network. We apply the resulting semi-supervised model, Probabilistic Relational EM (PR-EM) to the Active Exploration domain to iteratively locate positive examples in partially observed networks. Due to the sampling process, this domain exhibits extreme bias for learning and inference: we show that PR-EM operates with high accuracy despite the difficult domain. Third, we investigate the performance applying Relational EM methods for semi-supervised relational learning in partially labeled networks and find that fixed point estimates have considerable approximation errors during learning and inference. To solve this, we propose the stochastic Relational Stochastic EM and Relational Data Augmentation methods for semi-supervised relational learning and demonstrate that these approaches improve over the Relational EM method. Fourth, we improve on existing semi-supervised learning methods by imposing hard constraints on the inference steps, allowing semi-supervised methods to learn using better approximations during learning and inference for partially labeled networks. In particular, we find that we can correct for the approximated parameter learning errors during the collective inference step by imposing a Maximum Entropy constraint. We find that this correction allows us to utilize a better approximation over the unlabeled data. In addition, we prove that given an allowable error, this method is only a constant overhead to the original collective inference method. Overall, all of the methods presented in this thesis have provable subquadratic runtimes. We demonstrate each on large scale networks, in some cases including networks with millions of vertices and/or edges. Across all these approaches, we show that incorporating the uncertainty into the modeling process improves modeling and predictive performance.

Degree

Ph.D.

Advisors

Neville, Purdue University.

Subject Area

Statistics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS