Computing environment for the statistical analysis of large and complex data

Saptarshi Guha, Purdue University

Abstract

Analyzing large data has become very feasible with recent advances in modern technology. Data acquisition has become very fine grained and is available across many scenarios from home power consumption, to network data. Such data sets which can collect data at the level of seconds, quickly become massive. The storage of such data is now possible because of the rapid fall in the hardware prices. It is has become the statisticians' challenge to analyze such massive data sets with the same level of comprehensive detail as is possible for much smaller analyses. Any detailed analysis of such data sets, necessarily creates many subsets and many more data structures. We need approaches to store and compute with them by taking advantage of modern technology such as distributed compute clusters. This forms the backdrop to the three chapters of the thesis: Visualization database for large data sets, a keystroke detection algorithm derived from analyzing hundreds of gigabytes of network data and a merger of the R and Hadoop programming environments that enables topics covered in the first two chapters. Comprehensive visualization that preserves the information in the data requires a visualization database (VDB): many displays, some with many pages, and with one or more panels per page. A single display results from partitioning the data into subsets, and using the same method to display each subset in a sample of subsets, typically one per panel. The time of the analyst to produce a display is not increased by choosing a large subset over a small one, and every page does not necessarily need to be studied. Some displays might be studied in their entirety; for others, studying only a small fraction of the pages might suffice. On-the-fly computation without storage does not generally succeed because computation time is too large. The sizes and numbers of displays of VDBs require a rethinking all areas involved in data visualization, including the following: Methods of display design that enhance pattern perception to enable rapid page scanning; Automation algorithms for basic display elements such as the aspect ratio, scales across panels, line types and widths, and symbol types and sizes; Methods for subset sampling; Viewers designed for multi-panel, multi-page displays that scale across different amounts of physical screen area. One such example of a detailed analysis of hundreds of gigabytes of data is the keystroke detection algorithm. This is a streaming algorithm detects SSH client keystroke packets in any TCP connection. Input data are timestamps and TCP-IP header fields of packets in both directions, measured at a monitor on the path between the hosts. The algorithm uses the packet dynamics just preceding and following a client packet with data to classify the packet as a keystroke or non-keystroke. The dynamics are described by classification variables derived from the arrival timestamps and the packet data sizes, sequence numbers, acknowledgement numbers, and flags. The algorithm succeeds because a keystroke creates an identifiable dynamical pattern. One application is identification of any TCP connection as an SSH interactive session, allowing detection of backdoor SSH servers. More generally, the algorithm demonstrates the potential for the use of detailed packet dynamics to classify connections. The above analysis of network data would be extremely unwieldy (if not impossible) were it not using distributed file systems and computing frameworks. RHIPE is a software system that integrates the R open source project for statistical computing and visualization with the Apache Hadoop Distributed File System (HDFS) and the Apache MapReduce software framework for the distributed processing of massive data sets across a cluster. Distributed programming with massive data sets is by nature complex—issues such as data storage, scheduling and fault tolerance must all be handled. RHIPE uses its tight integration with the HDFS to store data across the cluster. Similarly, it takes advantage of MapReduce to efficiently utilize all the processing cores of the cluster. Vital, but difficult to implement details, such as task scheduling, bandwidth optimization and recovery from failing computers are handled by Hadoop MapReduce. Most importantly, RHIPE hides these details from the R user, by providing an idiomatic R interface to Hadoop and HDFS cluster. The design of RHIPE strives for a balance between conceptual simplicity, ease of use and flexibility. Algorithms, designed for the MapReduce programming model, can be implemented using the R language, executed from R's REPL (read-eval-print-loop) and the results are directly returned to the user.

Degree

Ph.D.

Advisors

Cleveland, Purdue University.

Subject Area

Statistics|Computer Engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS