Efficient and Scalable Subgraph Statistics Using Regenerative Markov Chain Monte Carlo

Mayank Kakodkar, Purdue University

Abstract

In recent years there has been a growing interest in data mining and graph machine learning for techniques that can obtain frequencies of k-node Connected Induced Subgraphs (k-CIS) contained in large real-world graphs. While recent work has shown that 5-CISs can be counted exactly, no exact polynomial-time algorithms are known that solve this task for k > 5. In the past, sampling-based algorithms that work well in moderately-sized graphs for k ≤ 8 have been proposed. In this thesis I push this boundary up to k ≤ 16 for graphs containing up to 120M edges, and to k ≤ 25 for smaller graphs containing between a million to 20M edges. I do so by re-imagining two older, but elegant and memory-efficient algorithms -- FANMOD and PSRW -- which have large estimation errors by modern standards. This is because FANMOD produces highly correlated k-CIS samples and the cost of sampling the PSRW Markov chain becomes prohibitively expensive for k-CIS’s larger than k> 8.In this thesis, I introduce:(a) RTS: a novel regenerative Markov chain Monte Carlo (MCMC) sampling procedure on the tree, generated on-the-fly by the FANMOD algorithm. RTS is able to run on multiple cores and multiple machines (embarrassingly parallel) and compute confidence intervals of estimates, all this while preserving the memory-efficient nature of FANMOD. RTS is thus able to estimate subgraph statistics for k ≤ 16 for larger graphs containing up to 120M edges, and fork ≤ 25 for smaller graphs containing between a million to 20Medges.(b) R-PSRW: which scales the PSRW algorithm to larger CIS-sizes using a rejection sampling procedure to efficiently sample transitions from the PSRW Markov chain. R-PSRW matches RTS in terms of scaling to larger CISsizes.(c) Ripple: which achieves unprecedented scalability by stratifying the R-PSRW Markov chain state-space into ordered strata via a new technique that I call sequential stratified regeneration. I show that the Ripple estimator is consistent, highly parallelizable, and scales well. Ripple is able to count CISs of size up to k ≤ 12 in real world graphs containing up to 120Medges.My empirical results show that the proposed methods offer a considerable improvement over the state-of-the-art. Moreover my methods are able to run at a scale that has been considered unreachable until now, not only by prior MCMC-based methods but also by other sampling approaches.Optimization of Restricted Boltzmann Machines. In addition, I also propose a regenerative transformation of MCMC samplers of Restricted Boltzmann Machines RBMs. My approach, Markov Chain Las Vegas (MCLV) gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has a maximum number of Markov chain step-count K (referred as MCLV-K). I present a MCLV-K gradient estimator (LVS-K) for RBMs and explore the correspondence and differences between LVS-K and Contrastive Divergence (CD-K). LVS-K significantly outperforms CD-K in the task of training RBMs over the MNIST dataset, indicating MCLVto be a promising direction in learning generative models.

Degree

Ph.D.

Advisors

Ribeiro, Purdue University.

Subject Area

Artificial intelligence|Operations research

Download

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.

COinS

Efficient and Scalable Subgraph Statistics Using Regenerative Markov Chain Monte Carlo

Abstract

Degree

Advisors

Subject Area

Search

Links

Links for Authors

Browse

Efficient and Scalable Subgraph Statistics Using Regenerative Markov Chain Monte Carlo

Abstract

Degree

Advisors

Subject Area

Share

Search

Links

Links for Authors

Browse