Organizing historical agricultural data and identifying data integrity zones to assess agricultural data quality

Elizabeth Marie Hawkins, Purdue University


As precision agriculture transitions into decision agriculture, data driven decision- making has become the focus of the industry and data quality will be increasingly important. Traditionally, yield data cleaning techniques have removed individual data points based on criteria primarily focused on the yield values. However, when these methods are used, the underlying causes of the errors are often overlooked and as a result, these techniques may fail to remove all of the inaccurate data or remove “good” data. As part of this research, an alternative to data cleaning was developed. Data integrity zones (DIZ) within each field were identified by looking at metadata which included data collected by the combine that reported the operating conditions of the machinery (i.e. travel speed, crop mass flow), data about the field environment (i.e. soil type, topography, weather), and data of field operations (e.g., field logs, as-applied maps). Ten years of historical data from the Southeast Purdue Agricultural Center (5 years of corn and 4 years of soybeans) and the Northeast Purdue Agricultural Center (1 year of corn) were used for analysis. Data in DIZ were isolated using buffers and the analysis of the reduced datasets was compared to the raw data. The amount of data that was removed depended on the amount of variation in the field, approximately 70% for the 14.5 acre SEPAC J4 field and 30% for the 25 acre NEPAC S13 field. Statistical comparisons of the data showed the mean yield estimate increased by an average of 22 bu/ac for corn and 3 bu/ac for soybeans when DIZ data was used compared to raw data. On average the standard deviation decreased by 24 bu/ac and 5 bu/ac for corn and soybeans, respectively, indicating that the data collected in these zones was more consistent and contained less noise and fewer errors. The average change in the standard error of the mean was 0.08 for corn and 0.09 for soybeans when the DIZ data was used. The temporal yield indices for each soil type zone showed that the yield responses were more stable when DIZ data was used for analysis instead of raw data. The estimates provided by these smaller, more accurate datasets are also more likely to be representative of the treatments that are being compared. The data collected in DIZ and the data collected outside of DIZ were compared for differences. The non-DIZ data contained much more variation in two key measurements that have been shown to affect data quality: combine travel speed (CV was 2.2x higher for corn non-DIZ data and 1.5x higher for soybean non-DIZ data) and crop mass flow rate (CV was 2.6x higher for corn non-DIZ data and 1.6x higher for soybean non-DIZ data). This alternative to data cleaning effectively removed errors and artifacts from yield data. When these reduced datasets are used to analyze historical yield data over time, they may provide a clearer picture of true yield effects; this will improve decisions on input and resource allocation, support wiser adoption of precision agricultural technologies, and refine future data collection.




Buckmaster, Purdue University.

Subject Area

Agricultural engineering|Information science

Off-Campus Purdue Users:
To access this dissertation, please log in to our
proxy server