Methods for analyzing of rankings and network intrusion detection

Paul Kidwell, Purdue University

Abstract

Part I. Ranking data, which result from m raters ranking n items, are difficult to visualize due to their discrete algebraic structure, and the computational difficulties associated with them when n is large. This problem becomes worse when raters provide tied rankings or not all items are ranked. We develop an approach for the visualization of ranking data for large n which is intuitive, easy to use, and computationally efficient. The approach overcomes the structural and computational difficulties by utilizing a natural measure of dissimilarity for raters, and projecting the raters into a low dimensional vector space where they are viewed. The visualization techniques are demonstrated using voting data, jokes, and movie preferences. Ranking data is frequently encountered and is not easily modeled. Real world ranking applications often introduce the additional data quality issues of ties or missing data. Previous modeling efforts have established non-parametric kernel estimation as an effective tool for modeling rankings. We propose a discrete analogue to the triangular kernel which through its combinatoric and statistical properties allows the non-parametric approach to be efficiently applied in the case of ties and extended to missing data. The exact computation of bias and variance can be performed for a range of censoring schemes. Part II. A rules-based statistical model (RBSM) identifies packets in any TCP connection that are client keystrokes of an ssh login. The input data of the algorithm are the packet arrival times and TCP/IP headers of the connection packets at a point along the path of the connection. The algorithm is applied to all connections seen by a network monitor. This forms a network login database that can be further analyzed for network security monitoring and forensics. The model—which uses the packet sizes, direction, flags, and interarrival times—first goes through the packets identifying epochs of different activities, and then goes back and uses more detailed information for the classification. Performance from three types of packet traces is excellent. Previous work has proceeded by forming connection summary statistics from the headers and timestamps, and classifying the connection as one with keystrokes or not using the statistics. The RBSM takes on a much more ambitious task of classifying each packet as a client keystroke packet or not, but in the end the classification of the connection has extremely low false positives and false negatives.

Degree

Ph.D.

Advisors

Lebanon, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS