Scalable Bayesian Nonparametrics and Sparse Learning for Hidden Relationship Discovery

Shandian Zhe, Purdue University

Abstract

Real-world data often encompass hidden relationships, such as interactions between modes in multidimensional arrays (or tensors), subsets of features correlated to specific responses, and associations between heterogeneous data sources. Uncovering these relationships is a key problem in machine learning and data mining, and relates to numerous applications ranging from information security to imaging genetics and to computational advertisement. However, to mine these relationships, we have to face several significant challenges. First, how can we design powerful models to capture the complicated, potentially highly nonlinear patterns in data? Second, how can we develop efficient model estimation algorithms to deal with real-world large data volumes, say, millions of features and billions of tensor elements? In this dissertation, we aim to address these challenges using Bayesian learning techniques. Compared with other types of methodologies, Bayesian learning has a unique advantage — it provides a highly principled, interpretable mathematical framework for data modeling and reasoning under uncertainty. We use two families of Bayesian approaches, namely Bayesian nonparametrics and sparse learning, to uncover the fundamental relationships hidden in data. That is, the interactive relationships between multiple entities within tensors, where each mode represents a particular type of entity, e.g. a three-mode (user, movie, music) tensor, and the correlated relationships between features and responses in high dimensional and multiview data. Bayesian nonparametrics allow the number of model parameters to grow along with data and hence can automatically fit the complexity of the data patterns. Therefore, Bayesian nonparametric models are powerful to capture the complicated, (possibly) highly nonlinear interactions. Bayesian sparse learning filters out noises and identifies useful, succinct patterns from data and therefore are particularly suitable to discover the correlated relationships, which are often sparse in the data. To address the computational challenges in large-scale applications, we explore various means, such as divide-and-conquer modeling, local computation, variational transformations and factorized approximations, to obtain decomposable mathematical structures in the learning objective functions. Based on these, we develop efficient parallel or online model estimation algorithms to handle real-world large-scale data. Specifically, first, we design Bayesian nonparametric factorization models, based on Gaussian processes and Dirichlet processes, to capture the nonlinear interactive relationships underlying tensor data and to further discover hidden clusters within tensor modes. We develop a scalable online inference algorithm on a single machine, as well as highly efficient parallel inference algorithms for use on Hadoop and Spark clusters. We demonstrate their impressive accuracy gains for tensor completion tasks in billion-entry data, as compared with the traditional methods. Second, based on the spike-and-slab prior, we develop an online Bayesian sparse learning algorithm to identify subsets of features correlated to interesting responses, from large-scale high dimensional data with millions of samples and features. We demonstrate its significant advantages over competing state-of-the-art approaches in large-scale applications including text classification and click-through-rate prediction for online advertising. Finally, in order to capture the cross correlations between features from heterogeneous data views, we use the spike-and-slab priors and Gaussian processes to develop a sparse multiview learning model. We show its successful application in association discovery and diagnosis in data from an Alzheimer’s disease study.

Degree

Ph.D.

Advisors

Qi, Purdue University.

Subject Area

Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS