The Tessera D&R computational environment: Designed experiments for R-Hadoop performance and Bitcoin analysis

Jianfu Li, Purdue University

Abstract

D&R is a statistical framework for the analysis of large complex data that enables feasible and practical analysis of large complex data. The analyst selects a division method to divide the data into subsets, applies an analytic method of the analysis to each subset independently with no communication among subsets, selects a recombination method that is applied to the outputs across subsets to form a result of the analytic method for the entire data. The computational tasking of D&R is nearly embarrassingly parallel, so D&R can readily exploit distributed, parallel computational environments, such as our D&R computational environment, Tessera. In the first part of this dissertation, I present a study of the performance of the Tessera D&R computational environment through designed experiments. The base of the D&R computational environment is RHIPE, the R and Hadoop Integrated Programming Environment. R is a widely used interactive language for data analysis. Hadoop is a distributed, parallel computational environment consisting of a distributed file system (HDFS) and distributed compute engine (MapReduce). RHIPE is a merger of R and Hadoop. The D&R framework enables a fast embarrassingly parallel computation on a cluster for large complex data that can lead to a small computational elapsed times for the applications analytic methods to all of the data. However, the time depends on many factors. The system we study is very complex and the effects of factors are complex. There are interactions, but not well understood. So we run a full factorial experiment with replicates to enable an understanding. In the second part of this dissertation, I present an analysis of the Bitcoin transaction data utilizing the Tessera D&R computational environment. Bitcoin is a de-centralized digital currency system. There is no central authority in the Bitcoin system to issue new money, or validate the transfer of money; both of these tasks are accomplished through the joint work of participants in the Bitcoin network. In the past two years, the Bitcoin system has become very popular, mostly due to its ease of use and embedded anonymity in the system. The ease of use of Bitcoin is straightforward. The anonymity of the Bitcoin system, on the other hand, is rather debatable and has drawn much attention in its user community as well as the research community. We admit that a certain level of anonymity exists in the Bitcoin system, but it might not be as invulnerable as one would hope. For one thing, the entire history of Bitcoin transactions is publicly available, which provides an opportunity for passive analysis of Bitcoin usage such as ours. I present here a study of the general statistical properties of the usage of Bitcoin transactions and the usage of Bitcoin addresses. We have also built profiles for a few groups of popular addresses among which the addresses share similar behavior. Furthermore, we provide a passive analysis of the anonymity of Bitcoin system by proposing a classification model to identify payment and change in majority of the Bitcoin transactions.

Degree

Ph.D.

Advisors

Cleveland, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS