Model Based Clustering Algorithms with Applications

Wutao Wei, Purdue University


In machine learning predictive area, unsupervised learning will be applied when the labels of the data are unavailable, laborious to obtain or with limited proportion. Based on the special properties of data, we can build models by understanding the properties and making some reasonable assumptions. In this thesis, we will introduce three practical problems and discuss them in detail. This thesis produces 3 papers as follow: Wei, Wutao, et al. "A Non-parametric Hidden Markov Clustering Model with Applications to Time Varying User Activity Analysis." ICMLA2015 Wei, Wutao, et al. "Dynamic Bayesian predictive model for box office forecasting." IEEE Big Data 2017. Wei, Wutao, Bowei Xi, and Murat Kantarcioglu. "Adversarial Clustering: A Grid Based Clustering Algorithm Against Active Adversaries." Submitted User Profiling Clustering: Activity data of individual users on social media are easily accessible in this big data era. However, proper modeling strategies for user profiles have not been well developed in the literature. Existing methods or models usually have two limitations. The first limitation is that most methods target the population rather than individual users, and the second is that they cannot model non-stationary time-varying patterns. Different users in general demonstrate different activity modes on social media. Therefore, one population model may fail to characterize activities of individual users. Furthermore, online social media are dynamic and ever evolving, so are users’ activities. Dynamic models are needed to properly model users’ activities. In this paper, we introduce a non-parametric hidden Markov model to characterize the time-varying activities of social media users. In addition, based on the proposed model, we develop a clustering method to group users with similar activity patterns. Adversarial Clustering: Nowadays more and more data are gathered for detecting and preventing cyber-attacks. Unique to the cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. In the past most of the work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in real practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the core positions of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies. Dynamic Bayesian Update for Profiling Clustering: Movie industry becomes one of the most important consumer business. The business is also more and more competitive. As a movie producer, there is a big cost in movie production and marketing; as an owner of a movie theater, it is also a problem that how to arrange the limited screens to the current movies in theater. However, all the current models in movie industry can only give an estimate of the opening week. We improve the dynamic linear model with a Bayesian framework. By using this updating method, we are also able to update the streaming adversarial data and make defensive recommendation for the defensive systems.




Xi, Purdue University.

Subject Area

Statistics|Computer science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server