Abstract
Understanding and accurately predicting crop yield is becoming increasingly important today in the face of global food security challenges, and thus, the availability of standardized data and scalable models is the need of the hour. To support this, researchers have developed CY-Bench (Crop Yield Benchmark), a comprehensive dataset that helps forecast maize and wheat yields on a global scale. This research project primarily involved working with the CY-Bench dataset aiming to improve crop yield prediction through machine learning. Initially, papers explaining the CY-Bench dataset and other papers for agriculture modeling were studied and analyzed in detail. The research then progressed to reproducing the benchmark results, showcasing the accessibility of the dataset. Building on this foundation, the research then progressed to developing new models such as regression trees and incorporating new derived features such as Leaf Area Index and Evapotranspiration in existing models. To streamline development, a subset focusing on Tippecanoe County was isolated from the broader US dataset. The results showed a reduction in MAPE for 2 out of the 3 models which were integrated with the newly engineered features. These initial outcomes are very promising, and there remains scope for further improvement. Including new features and experimenting with advanced models could potentially help improve the accuracy of the predictions.
Keywords
Crop Yield Prediction, Machine Learning, Feature Engineering, CY-Bench Data
Date of this Version
8-4-2025
Recommended Citation
Charan, Vaibhav and Poudel, Pratishtha, "Crop Yield Prediction at Multiple Spatial Scales with Statistical Machine Learning" (2025). Discovery Undergraduate Interdisciplinary Research Internship. Paper 66.
https://docs.lib.purdue.edu/duri/66
Included in
Agriculture Commons, Computer Sciences Commons, Data Science Commons, Statistical Models Commons