Enumerating K-Cliques in a Large Network Using Apache Spark
Network analysis is an important research task which explains the relationships among various entities in a given domain. Most of the existing approaches of network analysis compute global properties of a network, such as transitivity, diameter, and all-pair shortest paths. They also study various non-random properties of a network, such as graph densification with shrinking diameter, small diameter, and scale-freeness. Such approaches enable us to understand real-life networks with global properties. However, the discovery of the local topological building blocks within a network is an important task, and examples include clique enumeration, graphlet counting, and motif counting. In this paper, my focus is to find an efficient solution of k-clique enumeration problem. A clique is a small, connected, and complete induced subgraph over a large network. However, enumerating cliques using sequential technologies is very time-consuming. Another promising direction that is being adopted is a solution that runs on distributed clusters of machines using the Hadoop mapreduce framework. However, the solution suffers from a general limitation of the framework, as Hadoop's mapreduce performs substantial amounts of reading and writing to disk. Thus, the running times of Hadoop-based approaches suffers enormously. To avoid these problems, we propose an efficient, scalable, and distributed solution, KC-SPARK, for enumerating cliques in real-life networks using the Apache Spark in-memory cluster computing framework. Experiment results show that KC-SPARK can enumerate k-clique from very large real-life networks, whereas a single commodity machine cannot produce the same desired result in a feasible amount of time. We also compared KC-SPARK with Hadoop mapreduce solutions and found the algorithm to be 80–100 percent faster in terms of running times. On the other hand, we compared with the triangle enumeration with Hadoop mapreduce and results shown that KC-SPARK is 8–10 times faster than mapreduce implementation with the same cluster setup. Furthermore, the overall performance of KC-SPARK is improved by using Spark's inbuilt caching and broadcast transformations.
Hasan, Purdue University.
Off-Campus Purdue Users:
To access this dissertation, please log in to our