Data-derived metrics describing the behaviour of field-based citizen scientists provide insights for project design and modelling bias

Around the world volunteers and non-professionals collect data as part of environmental citizen science projects, collecting wildlife observations, measures of water quality and much more. However, where projects allow flexibility in how, where, and when data are collected there will be variation in the behaviour of participants which results in biases in the datasets collected. We develop a method to quantify this behavioural variation, describing the key drivers and providing a tool to account for biases in models that use these data. We used a suite of metrics to describe the temporal and spatial behaviour of participants, as well as variation in the data they collected. These were applied to 5,268 users of the iRecord Butterflies mobile phone app, a multi-species environmental citizen science project. In contrast to previous studies, after removing transient participants (those active on few days and who contribute few records), we do not find evidence of clustering of participants; instead, participants fall along four continuous axes that describe variation in participants’ behaviour: recording intensity, spatial extent, recording potential and rarity recording. Our results support a move away from labelling participants as belonging to one behavioural group or another in favour of placing them along axes of participant behaviour that better represent the continuous variation between individuals. Understanding participant behaviour could support better use of the data, by accounting for biases in the data collection process.


-Effect of active day threshold
Some of the metrics we introduce cannot be reliably calculated for participants who have submitted only a small number of records. We therefore started our analysis by removing all participants of iRecord Butterflies who were active on 10 or fewer days over the four years covered by the dataset. An active day is defined as a day on which a reported butterfly was observed, even though the report may be submitted on a different day. How many active days are required to get an accurate estimate of these metrics will depend on a number of factors including the number of taxa in to group being recorded, and variation in behaviour over time. We selected 10 active days since broadly matches the 'dabblers' group identified by Boakes et al 2016 both in terms of observations made per individual (6.1), and proportion of participants allocated to this group (84%). S1 -Recorder metric estimates plotted against the number of active days. The red dotted horizontal line indicates 10 active days. Periodicity and periodicity variation were not calculated for recorders with fewer than 5 active days. Note that the vertical axis is on the log scale.
Here we show what metrics look like for participants with fewer than 10 active days (figure S1). When the number of active days, and by definition number of observations, is low we see a large increase in the number of extreme values. This makes intuitive sense since the sample size for making estimates is small. As a result metrics that are attempting to average, or identify variability, are more likely to have extreme values. For the metrics activity ratio (activity_ratio), rarity recording (median_diff_rarity), periodicity (periodicity) and periodicity variation (periodicity_variation) using a cut-off of 10 active days reduces the range of values produced by approximately 50%.
Additionally we tested the sensitivity of our findings to changes in the threshold used. We reanalysed our data using thresholds of 7 and 15 (approximately a third higher and lower than 10). We found that the results of the principal component and therefore the definition of our axes of recorder behaviour were unchanged. Figure S2 The results of principal components analysis using a threshold number of active days of 7, 10 and 15. Principal components analysis undertaken using the three sets of participant metrics: temporal, spatial and data content. Symbols represent the clusters identified via k-means clustering, but which have low support across all analyses. Note that the results for each threshold are essentially the same. Though the PCA axes in some cases are reversed this does not affect the interpretation of the results. The figures used under the heading of the 10 active day threshold are the same as those used in figure 3 in the manuscript.

-Principal components analysis of all metrics in a single analysis
Here we present the results of a principal components analysis of all metrics (temporal, spatial and data content), together in a single analysis. This used the same approach as described in the manuscript, other than the combination of all metrics in a single analysis. The top four axes in this combine analysis describe 82% of all the variation (table S1).  Table S1 -The results of a principal components analysis of all temporal, spatial and data content metrics. Since this PCA contains 10 metrics it can be hard to create a clear interpretation of the meaning of each of the principal components (figure S1). To aid comparison we examined the four principal components that describe the greatest variation with the four axes presented in the main body of the manuscript. If the same four interpretations for these four axes are found in both analyses it supports the conclusion that our choice to analyse the three sets of metrics in isolation did not affect the outcome of the analysis.  The first principal component is positively related to activity ratio, weekly activity, and the proportion of taxa recorded, and negatively related to periodicity and periodicity variation. This is very similar to the recording intensity axes identified in the PCA of temporal metrics (figure 3a), but with the addition of the data content metric 'proportion of taxa recorded'. The second principal component is driven by the three spatial metrics in the same configuration as in the PCA of spatial metrics in isolation, meaning that this can be interpreted as spatial extent in the same way as in the main body of the paper (figure 3b) The third principal component features all three of the data content metrics and the pattern of their loading is the same to the recording potential axis in the data-content-only PCA (figure 3c). The fourth axis is very strongly driven by only two data content metrics; rarity recording and single-species lists. This is the same as the result for the second principal component in the data-content-only analysis where, as here, proportion of taxa recorded has very little effect. The fourth axis in the combined analysis can there for be interpreted as rarity recording, as it is in the data-content-only analyses (figure 3c). All four axes in this combined analysis can be mapped clearly onto the four axes identified in the PCAs of each set of metrics in isolation. The drawback of this combined approach is that some principal components are being driven, by a lesser extent, by the other metrics. For example, in this combined analysis the first principal component has seven metrics with absolute loading values between 0.3 and 0.4. While the interpretation is similar to that in the temporal-metrics-only PCA presented in the body of the paper, the addition of other metrics with small, but not insignificant loading values, makes the interpretation less clear-cut. When using an absolute loading value cut-off of 0.35, only one metric is found in a principal component with metrics from another set (proportion of taxa recorded [data content metric] in principal component one [temporal metrics]). This generally supports our a prori hypothesis that metrics in one group (temporal, spatial, or data content) were not dependent on any other group.

-Estimating recorder metrics for spatial subsets
The metrics presented in main manuscript use all data available for their estimation. There are cases however where sub-setting the data prior to the estimation of metrics could be warranted. For example if you are calculating metrics for a citizen science project that spans Europe or the India then the species composition will vary significantly spatially and recorders are unlikely to have the ability to sample in all locations. As a consequence the rarity recording metric and the proportion of taxa recorded metric might not be suitable for your purpose.
To account for this we suggest spatial sub-setting and outline how this can be done here. In the case of Europe, we might decide to subset by country, since this will be the primary barrier to movement. In India where barriers are less discrete we might instead use buffers. We will demonstrate both methods here using our case study data from the UK.

Methods
The methods are straight forward. We simply subset the data prior to calculating metrics. In the case of country estimates we subset the data by country and then estimate the metrics for all the users with records in that country. If a recorder records in more than one country you will have estimates for each country they have recorded in. Using a buffer we draw a line a set distance around a recorders observations and use this to select data from all recorders that fall within the buffer and use this data to estimate the metrics

By Country
Here we will calculate the rarity recording metric by country for our example data.

plot(GB)
I have downloaded the shape files for the countries that make up Great Britain and will be using these for sub-setting. You can download similar boundaries from here: https://www.naturalearthdata.com/downloads/10m-cultural-vectors/ Next we convert our data into a spatial object.

plot(SP)
Now we loop though each country in my shape file (England, Scotland, Wales), and calculate the metrics for all recorders with records in those countries It is important to note that in this data set there are 75 recorders who have recorded in more than one country and as a result they appear more than once in our results.

# Empty object for all results
When we plot the data for the country estimates versus the global estimates we can see the difference this makes.

# Run the metric for all recorders globally
SR_all <-lapply(unique(cit_sci_data$recorder), FUN = speciesRarity, data = cit_sci_data, sp_col = 'species', recorder_col = 'recorder') # summarise as one The size of the buffer is very important and if you were to use this method we strongly suggest that you examine carefully the size of buffer you are going to use. If the buffer is too large then you will not account for the spatial variation that you are trying to account for. If the buffer is too small then there will be little other data within the buffer resulting in estimates that are primarily influenced by the recorders observations.
To demonstrate this we re-ran this analysis using 5 different buffer sizes # Test a number of different buffer sizes (km) buffer_sizes <-c(5, 10,30,50,100,200) for (