DOI

10.5703/1288284318532

Description

Ensuring the robustness of data preprocessing pipelines is essential for maintaining the reliability of machine learning model performance in the face of real-world data shifts. Traditional methods optimize preprocessing sequences for specific datasets but often overlook their vulnerability to future data variations. This research introduces a vulnerability score to quantify the susceptibility of preprocessing components to data shift. We propose a Linear Regression approach to establish a predictive relationship between the vulnerability of the pipeline components and changes in the model’s performance. The generated relationships act as explanations for practitioners of the system and help them quantify the robustness of the pipeline to data shift. For a given pipeline, we generate an explanation that highlights a tolerable threshold beyond which a component is considered shift-vulnerable and is likely to contribute to performance degradation. For the shift-vulnerable scenarios, we further suggest a new pipeline for system maintainers that preserves the model performance without retraining. The proposed framework delivers a risk-aware assessment, empowering practitioners to anticipate potential performance changes and adapt their pipeline strategies accordingly. Experimental results on several real-world datasets generate valid explanations for pipeline robustness and demonstrate the opportunities in this field of research.

Share

COinS
 

Explanations for Machine Learning Pipelines under Data Drift

Ensuring the robustness of data preprocessing pipelines is essential for maintaining the reliability of machine learning model performance in the face of real-world data shifts. Traditional methods optimize preprocessing sequences for specific datasets but often overlook their vulnerability to future data variations. This research introduces a vulnerability score to quantify the susceptibility of preprocessing components to data shift. We propose a Linear Regression approach to establish a predictive relationship between the vulnerability of the pipeline components and changes in the model’s performance. The generated relationships act as explanations for practitioners of the system and help them quantify the robustness of the pipeline to data shift. For a given pipeline, we generate an explanation that highlights a tolerable threshold beyond which a component is considered shift-vulnerable and is likely to contribute to performance degradation. For the shift-vulnerable scenarios, we further suggest a new pipeline for system maintainers that preserves the model performance without retraining. The proposed framework delivers a risk-aware assessment, empowering practitioners to anticipate potential performance changes and adapt their pipeline strategies accordingly. Experimental results on several real-world datasets generate valid explanations for pipeline robustness and demonstrate the opportunities in this field of research.