Explainable machine learning models of major crop traits from satellite-monitored continent-wide field trial data

Newman, Saul Justin; Furbank, Robert T.

doi:10.1038/s41477-021-01001-0

Article
Published: 04 October 2021

Explainable machine learning models of major crop traits from satellite-monitored continent-wide field trial data

Nature Plants volume 7, pages 1354–1363 (2021)Cite this article

4404 Accesses
24 Citations
27 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 11 January 2022

This article has been updated

Abstract

Four species of grass generate half of all human-consumed calories. However, abundant biological data on species that produce our food remain largely inaccessible, imposing direct barriers to understanding crop yield and fitness traits. Here, we assemble and analyse a continent-wide database of field experiments spanning 10 years and hundreds of thousands of machine-phenotyped populations of ten major crop species. Training an ensemble of machine learning models, using thousands of variables capturing weather, ground sensor, soil, chemical and fertilizer dosage, management and satellite data, produces robust cross-continent yield models exceeding R² = 0.8 prediction accuracy. In contrast to ‘black box’ analytics, detailed interrogation of these models reveals drivers of crop behaviour and complex interactions predicting yield and agronomic traits. These results demonstrate the capacity of machine learning models to interrogate large datasets, generate new and testable outputs and predict crop behaviour, highlighting the powerful role of data in the future of food.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: NVTs and large-scale environmental patterns captured by satellite.**

**Fig. 2: Schematic of model training and evaluation.**

**Fig. 3: Accuracy of yield models subject to holdout trial prediction of 100 random unobserved field trials.**

**Fig. 4: Cross-model and cross-species shifts in accuracy under rolling annual forecast prediction.**

**Fig. 5: Concordance and pattern of variable importance across ML models.**

A multiple species, continent-wide, million-phenotype agronomic plant dataset

Article Open access 23 April 2021

An interaction regression model for crop yield prediction

Article Open access 07 September 2021

Machine Learning Approach for Prescriptive Plant Breeding

Article Open access 20 November 2019

Data availability

All data are available from the Supplementary Information, the linked database descriptor publication⁵ uploaded to Scientific Data and the figshare⁷ repository, after screening under our own extensive imputations and quality controls and freely available for research or non-commercial purposes under a CC-BY-NC 3.0 license. Some data available in this repository⁵ are, alternately, available as a dataset on the Grains Research and Development Corporation website published under a CC-BY-NC 3.0 AU license. The dataset which forms the basis (in whole or part) for this paper is based predominantly on data sourced from the Grains Research & Development Corporation (GRDC) and GRDC’s extensive investment in the collection, development and presentation of that dataset is acknowledged. GRDC did not authorise the reproduction, publication or communication of the dataset. The dataset has not been subject to GRDC’s quality control processes, and does not include updates and corrections that have been made to the dataset and as such may be unreliable. Results of research based on the dataset should not be relied on for any purpose. Any person wishing to conduct research using National Variety Trials (NVT) data must approach GRDC directly with a research proposal, noting that terms and conditions may apply. Alternately, the extensively quality-controlled and imputed figshare repository remains freely accessible for research under a creative commons license⁷.

Code availability

All codes are available in Supplementary Codes 1 and 2.

Change history

11 January 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41477-022-01096-z

References

United Nations Department of Economic and Social Affairs. World Population Prospects: 2015 Revision (United Nations, 2016).
Burgueño, J., de los Campos, G., Weigel, K. & Crossa, J. Genomic prediction of breeding values when modeling genotype × environment interaction using pedigree and dense molecular markers. Crop Sci. 52, 707–719 (2012).
Article Google Scholar
Cabrera-Bosquet, L., Crossa, J., von Zitzewitz, J., Serret, M. D. & Luis Araus, J. High-throughput phenotyping and genomic selection: the frontiers of crop breeding converge. J. Integr. Plant Biol. 54, 312–320 (2012).
Zamir, D. Where have all the crop phenotypes gone? PLoS Biol. 11, e1001595 (2013).
Article CAS Google Scholar
Newman, S. J. & Furbank, R. T. A multiple species, continent-wide, million-phenotype agronomic plant dataset. Sci. Data 8, 116 (2021).
NVT Protocols v1.1. 75 (Grains Research and Development Corporation) https://nvt.grdc.com.au/trials/nvt-protocols (2020).
Newman, S. J. & Furbank, R. T. Continent-wide Agronomic Experiment Data (figshare, 2021); https://doi.org/10.6084/m9.figshare.c.5296369
Justice, C. O. et al. Land and cryosphere products from Suomi NPP VIIRS: overview and status. J. Geophys. Res. Atmos. 118, 9753–9765 (2013).
Article Google Scholar
Cohen, W. B. & Justice, C. O. Validating MODIS terrestrial ecology products: linking in situ and satellite measurements. Remote Sens. Environ. 70, 1–3 (1999).
Article Google Scholar
Wan, Z., Zhang, Y., Zhang, Q. & Li, Z.-L. Quality assessment and validation of the MODIS global land surface temperature. Int. J. Remote Sens. 25, 261–274 (2004).
Article Google Scholar
Huete, A. et al. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 83, 195–213 (2002).
Article Google Scholar
Schroeder, W., Oliva, P., Giglio, L. & Csiszar, I. A. The New VIIRS 375m active fire detection data product: algorithm description and initial assessment. Remote Sens. Environ. 143, 85–96 (2014).
Article Google Scholar
Deng, H. Interpreting tree ensembles with inTrees. Int. J. Data Sci. Anal. 7, 277–287 (2019).
Article Google Scholar
Breiman, L. & Cutler, A. randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.6-14 https://cran.r-project.org/web/packages/randomForest/randomForest.pdf (2012).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Article Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and Regression Trees (Chapman and Hall, 1984).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Steinwart, I. & Thomann, P. liquidSVM. R package version 1.2.4 https://cran.r-project.org/package=liquidSVM (2017).
Mevik, B. H. & Wehrens, R. The pls package: principal component and partial least squares regression in R. J. Stat. Softw. https://doi.org/10.18637/jss.v018.i02 (2007).
Schymanski, S. J., Or, D. & Zwieniecki, M. Stomatal control and leaf thermal and hydraulic capacitances under rapid environmental fluctuations. PLoS ONE 8, e54231 (2013).
Article CAS Google Scholar
Vialet-Chabrand, S. & Lawson, T. Dynamic leaf energy balance: deriving stomatal conductance from thermal imaging in a dynamic environment. J. Exp. Bot. 70, 2839–2855 (2019).
Article CAS Google Scholar
Gonzalez-Dugo, M. P. et al. A comparison of operational remote sensing-based models for estimating crop evapotranspiration. Agric. Meteorol. 149, 1843–1853 (2009).
Article Google Scholar
Food Balances (2014-) (FAO, 2016); http://faostat3.fao.org/download/FB/FBS/E
Sánchez-Azofeifa, A. et al. Estimation of the distribution of Tabebuia guayacan (Bignoniaceae) using high-resolution remote sensing imagery. Sensors 11, 3831–3851 (2011).
Article Google Scholar
Furbank, R. T., Sirault, X. R. R. & Stone, E. in Sustaining Global Food Security: The Nexus of Science and Policy (ed. Zeigler R. S.) 203–223 (CSIRO Publishing, 2019).
Alcorn, M. A. et al. in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 4840–4849 (IEEE, 2019). https://doi.org/10.1109/CVPR.2019.00498
Holloway, E. The unlearnable checkerboard pattern. Commun. Blyth Inst. 1, 78–80 (2019).
Article Google Scholar
Mohri, M. & Medina, A. M. in Algorithmic Learning Theory (eds Bshouty, N.H. et al.)124–138 (Springer Verlag, 2012).
Lehmann, J., Coumou, D. & Frieler, K. Increased record-breaking precipitation events under global warming. Clim. Change 132, 501–515 (2015).
Article Google Scholar
Westra, S. & Sisson, S. A. Detection of non-stationarity in precipitation extremes using a max-stable process model. J. Hydrol. 406, 119–128 (2011).
Article Google Scholar
Vaze, J. et al. Climate non-stationarity—validity of calibrated rainfall-runoff models for use in climate change studies. J. Hydrol. 394, 447–457 (2010).
Verdon-Kidd, D. C. & Kiem, A. S. Quantifying drought risk in a nonstationary climate. J. Hydrometeorol. 11, 1019–1031 (2010).
Article Google Scholar
Milly, P. C. D. et al. Climate change: stationarity is dead: whither water management? Science 319, 573–574 (2008).
Article CAS Google Scholar
Rosenzweig, C. et al. in Climate Change 2007: Impacts, Adaptation and Vulnerability (eds Parry, M. L. et al.) 79–131 (Cambridge Univ. Press, 2007).
Sun, F., Roderick, M. L. & Farquhar, G. D. Rainfall statistics, stationarity, and climate change. Proc. Natl Acad. Sci. USA 115, 2305–2310 (2018).
Article CAS Google Scholar
Lenaerts, B., Collard, B. C. Y. & Demont, M. Improving global food security through accelerated plant breeding. Plant Sci. 287, 110207 (2019).
McCarl, B., Villavicencio, X. & Wu, X. Climate change and future analysis: is stationarity dying? Am. J. Agric. Econ. 90, 1241–1247 (2008).
Article Google Scholar
Towell, G. G. & Shavlik, J. W. Knowledge-based artificial neural networks. Artif. Intell. 1, 119–165 (1994).
Article Google Scholar
Bae, J. K. & Kim, J. Combining models from neural networks and inductive learning algorithms. Expert Syst. Appl. 38, 4839–4850 (2011).
Article Google Scholar
Jamshidian, M. & Jalal, S. Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data. Psychometrika 75, 649–674 (2010).
Article Google Scholar
Dong, Y. & Peng, C.-Y. J. Principled missing data methods for researchers. Springerplus 2, 222 (2013).
Article Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).
Wright, M. N. & Ziegler, A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77, 1–17 (2017).
Article Google Scholar

Download references

Acknowledgements

We acknowledge funding from the Australian Research Council Centre of Excellence for Translational Photosynthesis (CE140100015). We wish to acknowledge the hard work of the many researchers and agronomists who collected the historical agronomic data for the Grains Research and Development Corporation National Variety Trials and formerly made these data freely available under a CC-BY-NC license. Without their previous generosity, data reprocessed and used in this work would remain unavailable and invisible, to both scientists and the farmers who fund the NVTs.

Author information

Authors and Affiliations

ARC Centre of Excellence for Translational Photosynthesis, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
Saul Justin Newman & Robert T. Furbank
Biological Data Science Institute, Australian National University, Canberra, Australian Capital Territory, Australia
Saul Justin Newman
Leverhulme Center for Demographic Science, University of Oxford, Oxford, UK
Saul Justin Newman

Authors

Saul Justin Newman
View author publications
You can also search for this author in PubMed Google Scholar
Robert T. Furbank
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.J.N. conceived and designed the study, wrote the code, performed the analysis, designed and plotted the figures and cowrote the manuscript. R.F. cowrote the manuscript and contributed to the experiment design.

Corresponding author

Correspondence to Saul Justin Newman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Plants thanks Manuel Marcaida and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Environmental and vegetation patterns captured by remote sensing.

Remote-sensing arrays capture seasonal patterns across a season, from sowing (red lines) through anthesis or flowering (blue lines), to harvest (orange lines), in diverse variables such as photosynthesis and leaf area (top row), vegetation indices and land surface temperatures (second row), cloud cover and raw reflectance values (third to fifth rows), infrared bands, latent heat flux, and fire events (fifth row). Variation shown for a single site (Turretfield 2010; Main Season planting), abbreviations as in Supplementary Table 2.

Extended Data Fig. 2 Learning curves for models constructed using progressive seasonal data.

Models constructed to predict yield from a given point in the season relative to sowing (x axis), in a wheat and b canola, trained on 2008–2017 data and used to predict 2018 season data. As the season progresses, later-season data is included and models improve in forecast accuracy (y axes) with xvBCRF models (black) outperforming PLSR models (blue) and LSVMs (orange) over almost all forecast horizons.

Extended Data Fig. 3 Prediction accuracy of BCRF models in diverse phenotypes under random holdout trial prediction.

Using complete season data from 90 days before to 270 days after sowing to train BCRF models, tested in 100 random holdout sites, allows the prediction of key agronomic traits such as: a grain protein content, b time to 50% flowering, and c Glucosinolate oil content (Canola only). However, BCRF models trained under the same criteria were of limited or no use for predicting grain size or volume metrics such as: d Hectolitre weight, e the fraction of grain below 2.0 mm and f thousand-grain weight. Numbers N indicate sample sizes across predicted holdout sites.

Extended Data Fig. 4 Prediction accuracy of annual forecast BCRF models trained using species-specific data.

Prediction accuracy of models trained using data to 200 DAS from within species, rather than cross-species models, provide good accuracy for projecting yield in a Canola (N = 608), b Chickpeas (N = 243), c Faba Beans (N = 188), d Lupins (N = 403), e Oats (N = 216), f Field Peas (N = 230), g Lentils (N = 331), h Triticale (N = 266), and i Wheat (N = 2296). Test data are for the latest observed year containing successful trials in each species.

Extended Data Fig. 5 Model complexity and accuracy excluding individual domains of input data.

Predictive accuracy within models, such as the a RPRM and b extreme gradient boosting models shown, was reduced by different rates under leave-one-out testing. Across all levels of model complexity (x axis), removal of satellite data (blue) produced the greatest reduction in accuracy relative to models trained using all available data (black circles). Exclusion of weather-station data (orange), metadata (pink), and management data (green) imparted similar and more limited costs in accuracy. Crossvalidation error rates (y axis, a) is the inverse of R² values and RMSE (y axis, b) is the root mean squared error for random holdout observations.

Extended Data Fig. 6 Model complexity and accuracy using single domains of input data.

Predictive accuracy within models, such as the a RPRM and b extreme gradient boosting models shown, was driven by inclusion of certain domains. Relative to models trained using all available data (black circles), satellite data alone (blue) produced models of greater accuracy than weather-station data alone (orange), metadata alone (pink) and management data alone (green) across all levels of model complexity (x axis). Crossvalidation error rates (y axis, a) is the inverse of R² values, RMSE (y axis, b) is the root mean squared error for random holdout observations.

Extended Data Fig. 7 Pruned decision tree for wheat yield.

Pruned decision tree predicting yield variation (green colour scale, t/Ha shown in boxes) from all input data to 200 DAS, illustrating the readable heuristics generated by tree models. Model accuracy under annual forecast prediction is R² = 0.79; RMSE = 1.13. Days after Sowing (DAS), Time of Sowing (TOS), Days before sowing (DBS), Middle Infrared (MIR), Enhanced Vegetation Index (EVI), Normalized Differenced Vegetation Index (NDVI), Leaf Area Index (LAI), MODIS Reflection Band (RB), all values rounded to 2 significant digits.

Extended Data Fig. 8 Pruned decision tree for wheat yield trained on data to time of sowing.

A pruned decision tree for wheat yield, constructed using data available at the time of sowing, illustrating the readable output of decision tree models (annual forecast accuracy R² = 0.3). Cumulative sum (CS), Monoammonium Phosphate (MAP), MODIS middle infrared reflectance band (MIR).

Extended Data Fig. 9 Flowering-window minimum temperatures and yield.

Average wheat yields (y axis) observed for a given variety-site combination exhibit a counterintuitive relationship with frost severity during flowering (here defined as 110–140 Days after Sowing, x axis): more severe frosts predicting higher yields (blue curve; locally weighted smoothed spline; N = 54,658). This may reflect a survival bias resulting from the absence of reporting zero-yielding trials, where only high-yielding crops survive extreme low temperatures.

Extended Data Fig. 10 Distribution of model accuracy from imputation error and noise.

There were substantial differences in the sensitivity of model predictions to imputation noise under both random-sample imputation (orange) and random forest imputation (blue) under forward-projection accuracy (N = 25 prediction models per boxplot; data presented as median values with boxes for IQR and whiskers for 95% CIs). Models with the lowest sensitivity to imputation-based variability were stratified sampling BCRFs and linear support vector machines (LSVMs).

Supplementary information

Supplementary Information

Supplementary Figs. 1–10 and Tables 1–7.

Reporting Summary

Supplementary Tables

Word file of Tables 1–7 (non-display items).

Supplementary Software

Supplementary Code 1.

Supplementary Software

Supplementary Code 2.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Newman, S.J., Furbank, R.T. Explainable machine learning models of major crop traits from satellite-monitored continent-wide field trial data. Nat. Plants 7, 1354–1363 (2021). https://doi.org/10.1038/s41477-021-01001-0

Download citation

Received: 30 November 2020
Accepted: 02 August 2021
Published: 04 October 2021
Issue Date: October 2021
DOI: https://doi.org/10.1038/s41477-021-01001-0