STATISTICAL MODELS OF TEXT: A SYSTEM THEORY APPROACH (ZIPF"S, BRADFORD'S, LOTKA'S LAW, GENERATION)
Abstract
A study is made of statistical models of text, including the laws proposed by Lotka LOTK26 , Bradford BRAD34 , and Zipf ZIPF49 , and the models proposed by Markov MARK13 , Mandelbrot MAND53 , and Simon-Yule SIMO55 . A system theory approach is developed and applied first to show the equivalence of the three laws; secondly, to propose a multivariate representation of text which exhibits three important empirical properties: marginal skewness, type-token relationship, and exponential gaps; and thirdly, to compare four leading models with respect to the multivariate representation and select the Simon-Yule model as an appropriate statistical model of text. A modification of the selected model, which gives better performance than the original one, is discussed. Further modification relating the Simon-Yule model to computational models of text generation in artificial intelligence is suggested for future research.
Degree
Ph.D.
Subject Area
Industrial engineering
Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server.