Pattern exploration and event detection from geo-tagged tweets

Yuqian Huang, Purdue University

Abstract

Twitter is one of the most famous social networking services in the world. With 313 million monthly active users, Twitter can produce around 6,000 tweets per second, which corresponds to around 500 million tweets per day and around 200 billion tweets per year. Besides being a successful company, Twitter provides a great opportunity to scientists from various disciplines. Twitter allows users to tweet with a location tag, which enables the connection of virtual networks to the events happening in real life. Because of the massive amount of valuable geographic information, location-based services, targeted advertising, and social network studies could benefit considerably from the Twitter dataset. There are two primary objectives in this research. One is to identify the tweeting patterns of individual users; the other is to retrieve public events as well as to detect potential events. To identify the patterns of an individual user, this research selects the tweets from this user within a particular time period. The tweets are grouped by the hour of the day and then the density-based spatial clustering of applications with noise (DBSCAN) method is applied to cluster the tweets from every hour. Based on this method, the tweets are classified into different clusters without predefining the number of clusters. With the calculation of the spatial and temporal probability of every cluster, the probability of the appearance of the user in a particular area at a given time can be predicted. In event detection, the whole dataset is grouped by the day of the year, and the daily dataset is classified into clusters through ST- DBSCAN (Spatial-Temporal DBSCAN) to discover events. The word frequency of every cluster is analyzed. The Latent Dirichlet Allocation (LDA) algorithm is applied to every cluster to understand the potential topics. The proposed workflows for these objectives are tested in four college cities: (1) West Lafayette, Indiana; (2) Bloomington, Indiana; (3) Ann Arbor, Michigan; and (4) Columbus, Ohio. The results and analyses are presented in this thesis. On this basis, several recommendations on producing better results and dealing with special cases are presented.

Degree

M.S.E.

Advisors

Shan, Purdue University.

Subject Area

Geographic information science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS