Utilising random forests in the modelling of Eragrostis curvula presence and absence in an Australian grassland system

Eragrostis curvula is an agronomically and ecologically undesirable perennial tussock grass dispersed across Australia. The objective of this study is to investigate relationships of ecologically relevant abiotic variables with the presence of E. curvula at a landscape scale in the Snowy Monaro region, Australia. Through vegetation surveys across 21 privately owned properties and freely available ancillary data on E. curvula presence, we used seven predictor variables, including Sentinel 2 NDVI reflectance, topography, distance from roads and watercourses and climate, to predict the presence or absence of E. curvula across its invaded range using a random forest (RF) algorithm. Assessment of performance metrics resulted in a pseudo-R squared of 0.96, a kappa of 0.97 and an R squared for out-of-bag samples of 0.67. Temperature had the largest influence on the model’s performance, followed by linear features such as highways and rivers. Highways’ high importance in the model may indicate that the presence or absence of E. curvula is related to the density of human transit, thus as a vector of E. curvula propagule dispersal. Further, humans’ tendency to reside adjacent to rivers may indicate that E. curvula’s presence or absence is related to human density and E. curvula’s potential to spread via water courses.

Eragrostis curvula is a species that has attracted government, public and scientific interest due to its impacts on ecological integrity and agricultural productivity [1][2][3] .E. curvula is a C4 perennial tussock grass in the family Poaceae.The Eragrostis genus has 300 species worldwide, with 66 present in Australia, 52 native, and 14 considered non-native species 4 .E. curvula was intentionally introduced into Australia as early as 1910 by the Australian Government as part of a larger scheme to introduce soil-stabilising plant species that may also offer a source of feed to grazing livestock 5,6 .In Australia, E. curvula has become an undesirable plant species in agricultural and ecological contexts, impacting species diversity, richness, and agricultural productivity 7,8 .However, evidence suggests that E. curvula can sustain agricultural enterprises but requires intensive management to be a moderate pasture grass, regarding sustaining livestock health and weight for productive agricultural output while still impacting native species diversity 5,6,9,10 .Since its introduction, E. curvula has spread and is now present in every state in Australia, including high densities in agricultural regions 7,8,11 .
Factors driving E. curvula's ability to distribute so widely in Australia and across the world include its high propagule pressure and phenotypic plasticity, allowing it to establish itself in a wide range of climatic conditions 6,10,[12][13][14] .The extent to which a species can alter its morphological or physiological traits to suit various landscape conditions defines the amount of phenotypic plasticity a species possesses [14][15][16] .Firn et al. 14 found E. curvula to have greater variability in six measured plant traits when growing conditions were altered compared to two other co-occurring native species (Eragrostis sororia and Aristida personata), indicating that E. curvula has greater phenotypic plasticity.However, this phenotypic plasticity has been shown to act as a double-edged sword; where when nutrients are applied to E. curvula, it will increase its resource use efficiency, thus absorbing nutrients desirable to foraging organisms and becoming a more desirable forage for herbivores and subsequent decrease in E. curvula abundance 14 .The findings by Firn et al. 14 support the ability of E. curvula to adapt its phenological development to a wide range of conditions but also offer unique insight into potential management strategies.More broadly, E. curvula's continent-wide distribution may result from its ability to adapt to various stressful and varied environmental conditions 16 .This phenotypic plasticity appears common across cultivars

Study area
The extent of the study area covers the New South Wales Local Government Area of Snowy Monaro Regional Council (hereafter SMRC), which has an approximate area of 1,516,000 hectares (top: −35.579312, right: 149.602741, bottom: −37.262980, left: 148.200678) (Fig. 1).The mean annual temperature within the SMRC between 1976 and 2005, derived from monthly mean temperature data, is 9.4 °C, with a standard deviation of 1.5 °C37 .The notable rivers of the region are the Murrumbidgee and Snowy, and many more sub-water ways connecting to the river systems.Two major highways run within the SMRC; the Monaro highway runs approximately north-south, and Snowy Mountains highway runs from the northwest, meeting up with the Monaro highway at Cooma, then continues east approximately 9 km south of Nimmitabel, which provides transport of livestock, fodder, residence and tourists.The SMRC ranges in elevation from a minimum of 213 m above sea level to a maximum of 2223 m above sea level 38,39 .The region's mean elevations are approximately 1000 m above sea level, with a standard deviation of 259.2 m 38,39 .The region's higher elevations are located towards the east along the portion of the Great Dividing Range commonly known as the Australian Alps and home to Mt Kosciuszko.

Field method
Field data collection occurred from September to December 2019 and again from August to October 2021 (Supplementary Information).The 2019 data collection method involved placing a transect in a location representing the vegetation community of interest and perpendicular to established roads of varying use.Transects were established in areas that contained either high or no/low density of E. curvula.Low-density sites are defined as not having any E. curvula present, or where E. curvula is present; it is not present in swards larger than 10 m 2 .High-density sites are defined as having E. curvula in dense swards covering the majority of the sampled site.The transects were 100 m long with 20, 50 × 50 cm quadrats stratified along the length, with GPS points recorded at each quadrat (Fig. 2).The number of replicates at each site depended on the property's area, with larger properties having more transects, with between 20 and 80 quadrats per property.
The 2021 data collection method differed from that of 2019 in that out of the 21 properties originally sampled, six were selected as sites having a particularly high or low density of E. curvula or sites that have areas of high E. curvula density in some locations within the property and low density in others.At these sites, three areas of at least five hectares were outlined as a polygon using QGIS software, and ten points were randomly chosen within each polygon, with no points being within 20 m of another (Fig. 2).At the location of each of the points, a 1 × 1-m quadrat was placed, and a vegetation survey was conducted.Copernicus Sentinel data 43 provide global satellite imagery at spatial resolutions of ten meters for bands 2 (blue), 3 (green), 4 (red) and 8 (near infrared).Bands 4 and 8 can be used for the calculation of NDVI, an index of the greenness of vegetation and a proxy of vegetation density using the formula: The calculation is performed on each pixel within the satellite imagery using the open software provided by QGIS.org 44 .This research utilises Copernicus Sentinel data 43 collected on 01/11/2019 on sentinel tiles T55HFA, T55HFV, T55HGA and T55HGV.This date was chosen as it falls within the sampling period for this study's first data collection period.Using the open software QGIS.org 44 , values for NDVI were extracted at each data point using the sample raster values function, resulting in each point used to train the model having a corresponding NDVI value recorded at that location.
Using the open software QGIS (version 3.16.9), the distance (in metres) from each survey location to the nearest road was determined 44 .Roads included all named roads within NSW and farm tracks used to navigate the interior of properties individually added for each property surveyed; in addition, the role that highways may have played as a vector of spread was included in the RF as a separate predictor 45 .Using QGIS (version 3.16.9), the distance from each survey location to the nearest hydroline was obtained using the distance to nearest hub (line to hub) function.The Hydroline data includes surface water features such as lakes, rivers, creeks, canal drains, spillways, races, tunnel-siphons and connectors 46 .The values recorded are to the nearest meter Hydroline was subset only to include rivers, and the same methodology used to calculate the distance to hydroline was used.Mean annual temperature (°C) and rainfall (mm) data for the period 1976-2005 at the spatial resolution of 1 Arc-second were obtained 37 .One arc second is approximately 30 m but varies due to the earth's irregular oblate spheroid, or geoid, shape.The mean temperature and rainfall values were extracted at each survey point using the sample raster values function.Eragrostis curvula has been suggested to be present in the study area for at least 70 years; as such, it has spread over the area under a changing climate as a result of anthropogenic climate change 47 .As such, this research has chosen to use mean annual temperature data and monthly mean precipitation from 1976 to 2005 as it would best reflect the climatic conditions present in the region during a large time period of its establishment and spread 5,47 .For each site, the elevation (m) above sea level (asl) was determined using a one Arc-second resolution digital elevation model based on data collected in February 2000 Gallant et al. 38 .

Random forest model
Random Forest creates many (e.g.500) bootstrap samples, and the data that does not appear in each bootstrap sample is known as the out-of-bag (OOB) sample 35,36 .The bootstrapped samples are passed through a classification tree with randomised binary splits at each node with a random subset of the provided predictor variables 35,36 .The OOB sample from each bootstrap sample is then run through a classification tree, with each OOBs sample classification being recorded, with the final classification of a given sample being the majority vote of all of the classification trees 35,36 .We developed an RFM with binary response variables (hereafter RF P/A) for classifying the presence/absence of E. curvula based on seven predictor variables.The predictor variables for the RF P/A are elevation, distance to road, distance to highway, distance to river, distance to any watercourse, Sentinel 2 NDVI, mean temperature over 40 years and mean rainfall over 40 years.All predictor variables are continuous.
The data set for the RF P/A comprises 1292 observations, a combination of in-field measurements using field methods described above, totalling 1096 data points, and E. curvula presence data gained from the Atlas of Living Australia 11 , totalling 196 data points.The Atlas of Living Australia 11 data comprises 196 presence data points.The data set has 612 instances of E. curvula absence and 680 instances of E. curvula presence.
The RF P/A hyperparameter tuning was conducted using a grid search, where the researcher compared a combination of different hyperparameters chosen to determine the optimal model.As the RF P/A is created for prediction, the metric used to tune the hyperparameters is the error rate or the number of correct predictions the model performs on a data set 48 .

Statistical analyses
All mapping outputs were created using open-source GIS software version 3.16.9-Hannoverproduced by QGIS.org 44 .All model calculations were performed in the open-source R software Version 4.2.1 produced by RStudio Team 49 .The model was trained and evaluated using the train function inside the R package Ranger.Metrics to assess model performance are AUC, R squared for the out-of-bag samples (R squared OOB), pseudo R squared, Cohen's Kappa, sensitivity, specificity and F1 score 48 .Predictor variables of elevation and NDVI were found to be highly correlated (> 0.65) with other predictor variables in the model and thus removed as they had the lowest importance of the correlated variables when included in the RF P/A.Further, the ecological justification of retaining the correlated predictor variables of mean annual rainfall and mean annual temperature relates to the ecology of E. curvula as species tolerant of dry conditions but potentially intolerant to certain temperature extremes 6,10,50 .Due to the fluctuating nature of performance metrics when conducting RFM, the optimal RF P/A underwent 30 repetitions, with the performance metric being recorded from each interaction and allowing for a measure of spread using standard deviation.Predictions were generated using the predict function from the inbuilt stats package in R.

Presence/absence random forest model performance
The RF P/A for E. curvula has an AUC of 0.9995, R squared OOB of 0.67, and a pseudo-R squared of 0.96 (Table 1).The relatively low standard deviation of the model after 30 repetitions indicates the model is performing consistently across the 30 repetitions.The model had a very good agreement regarding Kappa with a value of 0.97, other performance metrics of sensitivity, specificity and F1 scores of 0.98, 0.99 and 0.98, respectively (Table 2).However, as Lantz 48 notes, the term very good agreement is accompanied by subjectivity.Comparing the pseudo-R squared, AUC and Kappa to the R squared OOB indicate that the model performs with high accuracy when classifying data that it was trained on but lost accuracy when classifying completely new data.Regardless, a random model that arbitrarily assigns points as having E. curvula present or absent would be expected to have an AUC of 0.5 and an R-squared OOB of 0, assuming a binary response and equal samples in the presence and absence dataset.All predictor variables had some influence on the RF P/A, with the removal of any one predictor increasing the mean error when permuted (Fig. 3).Temperature, distance from highway and distance from river were the most important predictor variables when minimising error in the RF P/A, respectively.The relationship between E. curvula and the three most important predictor variables does not appear linear when all other predictor variables are held constant, reflecting the complex interactions among predictor variables in predicting the distribution of E. curvula (Fig. 4).Variation in the response curve of the three most important predictor variables has minimal deviation from the general trend (Fig. 4).

Discussion
Adopting a modelling approach to predicting the potential presence and absence of E. curvula has the potential to significantly benefit efforts to suppress its spread throughout the region by assisting targeted, proactive management strategies.Our model is empirically based on physical and ecological parameters offering further value in identifying potential mitigation strategies to reduce or contain its spread.The predictor variables used in the RF P/A were chosen to be a balance between easily accessible data available for public download and ecologically relevant variables that would likely impact the distribution of an invasive species 6,10 .As such, the observed influence of individual parameters (and their interactions) gives functional insight into potential mechanisms that may be used to constrain spread.Figure 4 highlights the complex non-linear relationship between temperature, distance from highway and distance from river and E. curvula presence.Roozbeh Valavi et al. 28 conclude that flexible models, such as RF, can effectively predict a range of response variables in spatially distinct areas, such as those found in the SMRC.Roozbeh Valavi et al. 28 define a flexible model as a modal that can adapt to simple and complex data, fitting closely to the data when needed but avoiding overfitting by not capturing random noise.Tables 1 and 2 highlight high model performance, with a notable decrease in R squared (out of bag) compared to the other performance metrics.This notable decrease reflects the RF P/A limitations in modelling the presence or absence of E. curvula beyond the data it has been trained on.However, the RF P/A highlights the model's explanatory power due to the high Kappa and F1 scores.Specifically, the RF P/A produces few false positives and fewer false negatives (Table 2).Few false negatives are preferred in this context, as under the precautionary principle, it would be considered better to assume the presence of an invasive species, check and be incorrect rather than believe that there are no invasive species in an area, risk being incorrect and the invasive species continuing to spread 51 .Using a similar workflow, De Simone et al. 52 obtained an overall accuracy of 94.76% and a kappa of 88% when using RF for species distribution modelling of the invasive perennial grass Brachypodium genuense in an Italian grassland system.De Simone et al. 52 find slope, elevation and solar radiation to be the most important predictor variable of B. genuense distribution, as opposed to temperature, distance from roads and distance from rivers in our research.Differentiating this research from that of De Simone et al. 52 is that the selection the predictor variables used in this research were chosen based on their ecological relevance to E. curvula's occurrence 6,10,53 .The predictor variables of De Simone et al. 52 were chosen based on topographic variables available using the Sentinel-2 image database.Regardless, both De Simone et al. 52 and this research highlight the utility of RF in developing accurate species distribution models for modelling invasive species within grassland ecosystems.

Temperature as a predictor of Eragrostis curvula
The predictor variable of temperature was the most important variable in the RF P/A.Removing temperature as a variable from the model had the most detrimental effect on the RF P/A performance regarding error rate.Ngoy and Shebitz 20 found a similar association between E. curvula and temperature, where the mean annual temperature was one of the most significant predictors of E. curvula across New Jersey in the United States of America.The ecological relevance of temperature to plant growth relates to the optimal conditions a plant requires for the most efficient metabolic processes 54,55 .
Temperature has been identified as an important predictor variable in the RF P/A, but a simple linear relationship cannot be assumed due to the nature of RFM.For example, the sample point located at longitude 148.57and latitude −36.66 has an average annual temperature of 10.6 °C and has E. curvula present, while another sample point located at longitude 149.20 and latitude −36.29 also has an average annual temperature of 10.6 °C but does not have E. curvula present.Temperature in isolation is, therefore, not a reliable predictor in the RF P/A.However, when considered in the context of the other predictor variables, it becomes important in minimising error and maximising the accuracy of the RF P/A. Figure 4 highlights this complex relationship with no clear trend in the temperature response curve when all other predictor variables are held constant.A peak between 11 and 12 °C may indicate that E. curvula prefers this annual mean temperature within the study region.Roberts et al. 6,10 found that temperatures between 17/7 °C for E. curvula seed collected from Maffra, Tenterfield, and Shepparton, Australia, germination time and survival rate were lower than at higher temperature ranges when investigating the germination biology for four different populations of E. curvula.Roberts et al. 6,10 observed that temperature did not directly influence E. curvula seeds obtained from Midvale, Australia.The finding of Roberts et al. 6,10 provides further evidence to support the phenotypic plasticity of E. curvula and its ability to adapt to a wide range of environmental conditions.Thus, making E. curvula a species well adapted to spatial dispersal across landscapes if given appropriate dispersal corridors such as roads and waterways.

Roads as a predictor of Eragrostis curvula
The distance to highway was the next most important predictor variable in the RF P/A.Similar to temperature, due to the nature of RFM, the model does not make generalisations regarding the relationship between E. curvula and distance to highway.However, the ecological context suggests that as the distance from frequently used roads increases, the cover of invasive species such as E. curvula is likely to decrease, a trend supported by Fig. 4. Figure 4B shows a pattern of forecasting higher predictions of E. curvula present when the distance from highways is low, and all other predictor variables are constant.The distance to highway acts as a proxy to human dispersal and, by extension, a vector for E. curvula dispersal.Distance from human dispersal corridors, such as roads, consistently appears to influence invasive plant presence and density 16,28,30,35,44,[56][57][58] .Gelbard and Belnap 59 found that as road use increases, the cover and richness of invasive species also increase adjacent to these roads and decrease as distance increases from the road into the interior of the surrounding area.Hansen and Clevenger 60 suggest that corridor edges, such as roads and railway lines, are more prone to invasion as they may act as microhabitats or microclimates suitable for invasion recruitment.These microclimates are suitable

Limitations of the predicted presence and absence of Eragrostis curvula
Due to the large number of seeds E. curvula produces and its high germination rate, there will likely be other E. curvula plants within at least 1 hectare of any given plant 6,10 .Further, sites close to each other are likely to have more similar abiotic conditions than sites further away which is the basis of spatial interpolation 69 .Therefore, sites, where E. curvula was not observed when conducting the vegetation survey cannot be ruled out as incapable of supporting E. curvula, particularly if an E. curvula plant was observed within close proximity.Further, due to the relative course resolution of predictor variables of the climatic variable of temperature and rainfall, compared to the sampling spatial resolution, finer microclimatic variations are generalised over a larger area, potentially overlooking important fine scale influence on E. curvula distribution 70,71 .Despite several limitations to data collection imposed by COVID-19 travel restrictions and the Black Summer bushfires, our RF P/A showed value in outperforming the neutral random model and therefore has great potential to assist in targeted efforts for monitoring and managing E. curuvla presence.However, these limitations potentially influenced the availability and quality of data used to train the model due to the reliance on ancillary data, impacting the reliability of predictions.
Further, due to the intentional decision to use open-source, free-available data as predictor variables, there are inconsistencies in the spatial resolution of the data.Mishra et al. 72 show that increasing spatial resolution can result in higher classification rates of land use and land cover classes.The inconsistencies in the spatial resolution of the predictor variables could overinflate the importance of certain predictor variables.However, there is limited research on this as a factor influencing bias.

Conclusion
The predictor variables of annual mean temperature, distance from highway and distance from river were identified as the most important variables predicting the presence of E. curvula in the study area.These predictor variables align with previously identified predictor variables of E. curvula and other invasive plants relating to hydrochory and zoochory 6,10 .The RF P/A indicates conditions in which the presence of E. curvula is supported based on a range of open-access landscape predictor variables.However, the model's performance metrics based on extrapolating outside of the data it has been trained on should be interpreted cautiously.As such, the outcomes of our findings should be used as a tool as part of a larger toolbox rather than a rule used to manage E. curvula as part of integrated management strategies.However, the model has inherent limitations in high-resolution accuracy due to the decision to use easily accessible predictor variables with differing spatial resolutions.As more high resolution becomes freely available, further predictive models will utilise this and likely develop more precise and accurate models.
Furthermore, the RF P/A should be tested on data it has not been trained on and compare ground-truth observational data with predictions generated by the RF P/A.Regardless of how accurate predictive models become, community communication is necessary so properties without E. curvula neighbouring properties with E. curvula can act accordingly to put management strategies in place to minimise the risk of further spread, and the outcome from the RF P/A can be utilised to inform those at most risk.The RF P/A produced may aid community members in understanding the possible extent of E. curvula based on their location in the landscape and highlight the areas predicted to be the most susceptible to E. curvula presence.This research highlights the necessity for more detailed mapping of the current distribution and density of E. curvula across the Snowy Monaro to prevent the spreading from high-density E. curvula areas to low-density E. curvula areas.If such a map existed in conjunction with the provided output from the RF P/A, then concentrated survey and management efforts could be focused on the most at-risk areas.

Figure 1 .
Figure 1.Map of the study locations within the Snowy Monaro Local Government Area.The blue crosses indicate areas surveyed that did not have Eragrostis curvula present, while the green crosses indicate areas where E. curvula was absent.Temperature (°C) is represented as the annual mean temperature between 1976 and 2005 42 .

Figure 2 .
Figure 2. Schematic of the two methods used to collect data for one of the 21 properties used in this study.(a) The random points within polygons using the polygon method.Each quadrat in (a) is at least 20 m apart from another quadrat.The site polygons are at least 5 hectares in area and were used to bound the random placement of the quadrates within.(b) The transect method with quadrate spaced 5 m apart along a 100-m transect.The transects in (b) run perpendicular to roads or tracks within the property.

Figure 3 .
Figure 3. Variable importance plot showing the increase in mean error computed on the out-of-bag data across trees when predictors are permuted during the development of the random forest.

Figure 4 .
Figure 4. Response curves of the random forest model showing the top three important predictor variables derived from the model, (a) mean annual temperature (°C) 1976-2005, (b) distance from highway (m) and (c) distance from river (m).The bold black line indicates the response of E. curvula presence as a result of changes in the x-axis predictor variable, with all other predictor variables held at their 50% quantile.The thin grey lines show the variation of the response of E. curvula to the x-axis predictor variable after 30 repetitions.

Table 1 .
Performance metrics used to assess random forest performance.

Table 2 .
Confusion matrix of the random forest binary classification model predicting presence (1) and absence (0).