Group transformation and identification with kernel methods and big data mixed logistic regression

Chao Pan, Purdue University


Exploratory Data Analysis (EDA) is a crucial step in the life cycle of data analysis. Exploring data with effective methods would reveal main characteristics of data and provides guidance for model building. The goal of this thesis is to develop effective and efficient methods for data exploration in the regression setting. First, we propose to use optimal group transformations as a general approach for exploring the relationship between predictor variables X and the response Y. This approach can be considered an automatic procedure to identify the best characteristic of P( Y:X) under which the relationship between Y and X can be fully explored. The emphasis on using group transformations allows the approach to recover true group structures among the predictors. We also develop kernel methods for estimating the optimal group transformations based on cross-covariance and conditional covariance operators. The statistical consistency of the estimates has been established. We refer to the proposed framework and approach as the Optimal Kernel Group Transformation (OKGT) method. Secondly, we define the true additive group structure for OKGT when the response transformation is known, and further develop an effective penalized kernel regression method for its identification. The procedure uses a novel penalty we propose to control the complexity of additive group structures. This method is referred to as the Additive Group Structure Identification (AGSI). We also establish the selection consistency for AGSI. Finally, we construct the Hierarchical Mixed Logistic Regression Model (HMLRM) and propose to use it for exploring heterogeneity in big data. By explicitly modeling the hidden layer, we individualize the calculation of the probability that a sample belongs to a subpopulation. While estimating the model parameters by EM algorithm, the separability of the parameter space is exploited. In order to apply HMLRM on big data, we design a distributed algorithm for model estimation which is implemented in Apache Spark.




Zhu, Purdue University.

Subject Area


Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server