STATISTICAL MODELS OF TEXT: A SYSTEM THEORY APPROACH (ZIPF"S, BRADFORD'S, LOTKA'S LAW, GENERATION)

YE-SHO CHEN, Purdue University

Abstract

A study is made of statistical models of text, including the laws proposed by Lotka LOTK26 , Bradford BRAD34 , and Zipf ZIPF49 , and the models proposed by Markov MARK13 , Mandelbrot MAND53 , and Simon-Yule SIMO55 . A system theory approach is developed and applied first to show the equivalence of the three laws; secondly, to propose a multivariate representation of text which exhibits three important empirical properties: marginal skewness, type-token relationship, and exponential gaps; and thirdly, to compare four leading models with respect to the multivariate representation and select the Simon-Yule model as an appropriate statistical model of text. A modification of the selected model, which gives better performance than the original one, is discussed. Further modification relating the Simon-Yule model to computational models of text generation in artificial intelligence is suggested for future research.

Degree

Ph.D.

Subject Area

Industrial engineering

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS