Motif discovery via context dependent models

Chuancai Wang, Purdue University

Abstract

The identification of the transcription factor binding sites can help gain insight into the regulatory mechanism of a genome. Given a group of genes that may be co-regulated, it is challenging to accurately identify the DNA binding sites for unknown transcription factors. The current matrix-based methods mainly describe a binding site motif by a position-specific weight matrix that assumes independence of the motif positions and describe the nonsites/background by either a common residue frequency model or a Markov model with its order roughly determined by data size. These models make assumptions that might not hold in practice. In this thesis work, we characterize the motif by a series of position-dependent first order Markov models, which consider both the position-specific features as well as the dependence between the positions of the motif. We also propose a “step-up” testing procedure to determine the best-fitting background Markov order for the data. Our proposed models (methods) led to a novel motif discovery program MDCDM, which is written in C. MDCDM is often better suited to discriminate the true binding site motifs from the background nonsites than the existing programs BioProspector, MotifSampler, and Gibbs Motif Sampler.

Degree

Ph.D.

Advisors

Xie, Purdue University.

Subject Area

Statistics

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server
.

Share

COinS