Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)


Interdisciplinary Studies

First Advisor

William S. Cleveland

Committee Chair

William S. Cleveland

Committee Member 1

James Eric Dietz

Committee Member 2

Eugene H. Spafford

Committee Member 3

Bowei Xi


There are two essential goals of this research. The first goal is to design and construct a computational environment that is used for studying large and complex datasets in the cybersecurity domain. The second goal is to analyse the Spamhaus blacklist query dataset which includes uncovering the properties of blacklisted hosts and understanding the nature of blacklisted hosts over time.

The analytical environment enables deep analysis of very large and complex datasets by exploiting the divide and recombine framework. The capability to analyse data in depth enables one to go beyond just summary statistics in research. This deep analysis is at the highest level of granularity without any compromise on the size of the data.

The environment is also, fully capable of processing the raw data into a data structure suited for analysis.

Spamhaus is an organisation that identifies malicious hosts on the Internet. Information about malicious hosts are stored in a distributed database by Spamhaus and served through the DNS protocol query-response. Spamhaus and other malicious-host-blacklisting organisations have replaced smaller malicious host databases curated independently by multiple organisations for their internal needs. Spamhaus services are popular due to their free access, exhaustive information, historical information, simple DNS based implementation, and reliability. The malicious host information obtained from these databases are used in the first step of weeding out potentially harmful hosts on the internet.

During the course of this research work a detailed packet-level analysis was carried out on the Spamhaus blacklist data. It was observed that the query-responses displayed some peculiar behaviours. These anomalies were studied and modeled, and identified to be showing definite patterns. These patterns are empirical proof of a systemic or statistical phenomenon.