Introduction and background

Over the last few years we have witnessed a large growth in the capabilities and applications of artificial intelligence (AI) in citizen science (CS), a broad term referring to the active engagement of the general public in research tasks in several scientific fields, including astronomy and astrophysics, ecology and biodiversity, archeology, biology, and neuroimaging (Vohland et al., 2021). CS is an expanding field and a promising arena for the creation of human-machine systems with increasing computational abilities, as several CS projects generate large datasets that can be used as training materials for AI streams, such as machine-learning (ML) models and image and pattern recognition (Lintott and Reed, 2013). The integration of humans and AI applications can help process massive amounts of data more efficiently and accurately and monitor the results. Examples of AI applications in CS include image recognition as in iNaturalist (Van Horn et al., 2018) and Snapshot Serengeti (Willi et al., 2019), image recognition and classification to map human proteins in cells and tissues as in Human Atlas (Sullivan et al., 2018), and consensus algorithms to locate or demarcate objects as in Galaxy Zoo (Willett et al, 2013) and the Koster Seafloor Observatory (Anton et al., 2021). The integration of humans and computational technologies opens up new ways of collaboration between the two but also raises questions about the distribution of tasks, and how humans and this type of technologies can complement each other to expand their respective skills. As AI grows “smarter”, people become increasingly concerned with being replaced in many domains of activities. Opportunities and risks for work of digital technologies including AI have been a long-standing topic of inquiry across research disciplines and in policy documents. In a recent report for the European Parliament, the Panel for the Future of Science and Technology (STOA) (2021) pointed out that technology impacts the distribution of tasks within jobs; just as technology may help to improve skills and raise the quality of work, it can also result in deskilling and creating low paid and low autonomy work. Importantly, while technology can help to preserve work, it can also impact negatively the qualitative experience of work (Panel for the Future of Science and Technology (STOA), 2021). In this respect, during a discussion panel aimed to initiate a dialog on how citizen scientists interact and collaborate with algorithms held at the 3rd European Citizen Science 2020 Conference, participants expressed concern with the possible negative impact of AI on the qualitative experience of participation in CS. As mentioned during that discussion, the current rapid progress in ML for image recognition and labeling, in particular the use of deep learning through convolutional neural networks (CNN) and generative adversarial networks, may be perceived as a threat to human engagement in CS. If computational technologies can confidently carry out the tasks required, citizen scientists may feel that there is no space for authentic engagement in the scientific process (Ponti et al., 2021). This concern suggests the tension that arises when “designing a human-machine system serving the dual goals of carrying out research in the most efficient manner possible while empowering a broad community to authentically engage in this research” (Trouille et al., 2019, p.1).

Why task distribution matters and aim of the paper

Task distribution between humans and machines has always been a crucial step in the design of human-machine systems and a main topic of research in human-automation interaction (e.g., Dearden et al., 2000; Hollnagel and Bye, 2000; Sheridan, 2000). Considered an “evergreen” topic, task allocation has been covered by a large body of literature in different fields, including cognitive engineering, human factors, and human-computer interaction, but continues to be an important area for research on automation (Janssen et al., 2019). A prominent approach used for years to decide which tasks are better performed by machines or by humans has been the HABA-MABA (“Humans are better at, Machines are better at”) list firstly introduced by Fitts (1951). This list contains 11 “principles” recommending the functions that are better performed by machines and should be automated, while the other functions should be assigned to humans. Although researchers differ in what they consider appropriate criteria for task allocation (Sheridan, 2000), the influence of Fitts’s principles persists today in the human factors’ literature. De Winter and Dodou (2014) conclude that Fitts’s list is still “an adequate approximation that captures the most important regularity of automation” (p. 8). Given the primary interest of researchers to optimally distribute tasks between humans and machines to maximize efficiency and speed to achieve a given goal (Tausch and Kluge, 2020), this conclusion would arguably hold. However, as Tausch and Kluge (2020) noted, we need more research on task distribution in order to make decisions that allow not only an optimal allocation but also a rewarding experience for humans. This aspect is important in CS projects as they rely on volunteer engagement and concerns have been raised over the potential of AI to disengage citizen scientists: the use of AI can result in a reduction in the range of possible volunteer contributions or in the allocated tasks becoming either too simple or too complex (Trouille et al., 2019).

While task distribution to participants in citizen science projects has been studied by Wiggins and Crowston (2012), task distribution between experts, citizen scientists, and AI computational technologies (hereinafter also used interchangeably as “computational technologies”) does not appear to have been investigated. Therefore, we present a literature review to illustrate the current state of the distribution of tasks between humans and computational technologies in CS. We used an adapted version of the framework developed by Franzoni and Sauermann (2014) to analyze the results and highlight the differences in the nature of the task and the skills contributed by humans and this type of machines to perform those tasks. Through the analysis, we answer the following questions:

  1. 1.

    What tasks do citizen scientists, experts, and computational technologies perform to achieve the goals of citizen science projects?

  2. 2.

    What type of skills do citizen scientists, experts, and computational technologies need to perform their tasks?

  3. 3.

    In which activities do citizen scientists, experts, and computational technologies, respectively, perform their tasks?

We now clarify the terms used in the questions. We use Theodorou and Dignum’s (2020) definition of AI computational technologies as systems “able to infer patterns and possibly draw conclusions from data; currently AI technologies are often based on machine learning and/or neural networking-based paradigms” (p. 1). This definition is appropriate in this paper because our review is focused on technologies based on machine learning and/or neural networking-based paradigms. For a proper understanding of the term “task”, we refer instead to Hackman’s (1969, p. 113) definition of the term as a function assigned to an individual (or a group) by an external agent or that can be self-generated. A task includes a set of instructions that specify which operations need to be performed by an agent concerning an input, and/or what goal is to be achieved. We used Hackman’s (1969) conceptualization of tasks as a behavior description, that is, a description of what an agent does to achieve a goal. This conceptualization applies to both humans and machines performing tasks. The emphasis is placed on the reported behavior of the task performed. Regarding the activities included in our analysis, we discussed the tasks performed within activities of the research process such as data collection, data processing, and data analysis (McClure et al., 2020), as CS projects typically involve these activities. Regarding the term “expert”, we used it in a broad sense, to include not only professional scientists but also persons responsible for developing algorithms and running the projects.

The contribution of this paper is threefold: (1) providing the scholars studying CS and human computation with a synthesis of results shedding descriptive insights into the distribution of tasks in CS; (2) potential broader implications for how we think about work in general and how we organize work involving “non-experts” and computational technologies, and 3) point to important questions for future research.

The paper is organized as follows. We first describe the methodology used for collecting and assessing the reviewed papers. We then present the framework used for our analysis of the results. Building on this framework, in the Discussion section we propose a matrix to classify CS projects on the basis of the distribution of tasks between humans and computational technologies. We also reflect on the role of citizen scientists borrowing from a labor perspective. The last section presents conclusions from this study and points to future research.

Methods

Strategies for searching, screening, and assessing the literature

To answer our questions, we conducted an integrative literature review, a subcategory of systematic reviews (Torraco, 2016). Integrative reviews follow the principles of systematic reviews to ensure systematization and transparency in searches, screening, and assessment processes, but they allow more flexibility regarding selection and inclusion of literature. The strategy for searching, screening and assessing the literature followed the systematic approach of the PRISMA reporting checklist (Fig. 1).

Fig. 1
figure 1

Prisma flow diagram for paper selection.

The review corpus was sourced from the Web of Science, SCOPUS, and the Association for Computing Machinery (ACM) Digital Library—the single largest source of computer science literature. These three databases are well-established, multi-disciplinary research platforms, including a wide variety of peer-reviewed journals, and they are updated regularly. We used them because they constitute a baseline for the search of published peer-reviewed papers. However, we are aware that, in the case of citizen science publications, the true extent of these publications can be larger as studies can be published in non-peer-reviewed literature sources and would not be referred in these three databases. We did not include preprints.

We employed two search procedures. In the first search procedure (Table 1), we searched for papers containing “citizen science” and “artificial intelligence” or “machine learning” in the title, abstract, and keyword sections. However, after a brief initial scanning of the resulting papers, we added several other search terms to include the most widely used computational technologies, such as supervised learning, unsupervised learning, reinforcement learning, reinforcement algorithm, deep learning, neural network(s), and transfer learning. The search was limited to papers written in English and published until July 2020. We collected a total of 170 papers across the three databases, 99 of which were unique. Table 1 summarizes the search terms used and the number of results per database during search procedure 1.

Table 1 Search procedure 1.

The initial examination of these papers revealed that the chosen search strategy did not fully cover those focused on citizen science games. The reason for this is that some authors use game titles in the abstracts instead of referring to games as ‘citizen science’. Therefore, a second search procedure was employed using the same terms related to AI and ML as in the first procedure, and the titles of 36 citizen science games (Baert, 2019). The search was performed in two steps. First, the game titles ‘Neo’, ‘Turbulence’, and ‘The Cure’ were excluded, because the search generated many false-positive results not related to citizen science games. At the second step, to locate those articles that discuss ‘Neo’, ‘Turbulence’, and ‘The Cure’ games, an additional search was implemented using these titles and ‘game’ as search terms. The search using the ACM Digital Library did not produce any results, thus we excluded this database from the table. The searches were limited to articles written in English and published until July 2020. Table 2 contains the employed search strings and the number of generated results for both steps of the search procedure. The searches generated 28 results, 20 of which were unique. Out of the remaining 20 papers, 17 were not covered by the search procedure 1.

Table 2 Search procedure 2 (citizen science games).

Exclusion criteria

A single reviewer parsed all the retrieved papers based on their titles, abstracts, keywords, and, if necessary, full-text reading. Since we aimed to focus on papers reporting on projects implementing integrations of human and computational technologies, as they were more likely to describe how tasks were distributed, we applied several selection criteria to filter out irrelevant papers. We excluded papers that provided no significant focus and discussion on, or application of, computational technologies in CS. Studies excluded from the review included, for example, a comparison of the results of citizen scientist classifications with the results of computational technology’s models trained on expert-produced data (e.g., Wardlaw et al., 2018), or an overview of projects on Zooniverse with some description of how ML is used in these projects (Blickhan et al., 2018).

Table 3 presents the exclusion criteria and the number of excluded papers. As a result of the selection process, 50 papers were selected for the review (the list of all the included and excluded papers with reasons for why they were excluded is in Supporting Information 1, available at https://zenodo.org/record/5336431#.YSyzFtOA5EJ).

Table 3 Exclusion criteria for procedure 1 and 2 and number of excluded papers.

Analysis and synthesis strategy

The process of searching for relevant studies, filtering, data extraction, and synthesis took place from April 1st to September 30th 2020. To ensure consistency in the reporting, we used a spreadsheet for the 50 articles to note the author(s) and title for each article, the publication year, the source, the research field and type of the CS project. In addition, for each article we annotated the aim of the article, the computational technologies used, and the tasks assigned to citizens and experts (this extraction data is in Supporting Information File 1). These annotations provided a preliminary overview of the aspects relevant to address our research questions.

The framework

We used an adapted version of the conceptual framework developed by Franzoni and Sauermann (2014) to characterize citizen science projects with respect to two main dimensions: (a) the nature of the task outsourced to a crowd, and (b) the skills that crowd participants need to perform the task. We generalize this framework in order to map the tasks performed not only by the crowd but also by experts and AI computational technologies. We now describe the two dimensions in more details.

Nature of the task

Franzoni and Sauermann describe this dimension in an aggregate sense at the level of the crowd, with each individual making distinct contributions to that task by performing specific subtasks. They subsume under “nature of the task” two related attributes: the complexity of the task and the task structure. In the context of CS projects, Franzoni and Sauermann define task complexity as the degree of interdependency between the individual subtasks that participants perform when contributing to a project. In the simplest case, tasks are independent of each other. Task outputs can be pooled together in the end to generate the final output. In addition, contributors can perform their tasks independently. The authors made the example of Galaxy Zoo, where the correct classification of an image does not depend on the classifications of other images. However, when dealing with complex and interdependent tasks, the best solution to one subtask depends on other subtasks, so contributors must consider other contributions when working on their own subtask.

Franzoni and Sauermann define task structure as the degree to which the overall task outsourced to participants is well-structured, with clearly defined subtasks, or ill-structured, with specific subtasks not clearly defined from the start. Ill-structured tasks are said to provide fewer opportunities for the division of labor. The two attributes are typically highly correlated and we follow Franzoni & Sauermann by using a single dimension ranging from independent/well-structured to interdependent/ill-structured.

Skills need to perform the task

Franzoni and Sauermann distinguish three types of human skills: (a) common skills held in the general population, e.g., recognizing the shape of a plant; (b) specialized and advanced skills that are less common and not related to a specific scientific domain, e.g., specialized knowledge of certain technical tools, and (c) expert skills that are specifically related to a given scientific or technological domain, e.g., expert knowledge of mathematics.

This distinction is applicable to humans in general, including citizens and professional scientists. To be able to apply this framework to computational technologies as well, we expand this categorization to include different types of machine skills, in particular classification and prediction. We define them as the abilities to perform a task that computational technologies learn from training. We characterize these skills as being generated through programs of actions—consisting of goals and intentions – delegated by developers to technologies (Latour, 1994).

This framework forms a two-dimensional space that can categorize the tasks performed by citizen scientists, experts, and computational technologies, and the skills needed by humans and computational technologies to perform their tasks. We link these two dimensions to three main research activities/stages in the research process in which citizen scientists, experts, and computational technologies perform their tasks. The three activities include: data collection, data processing, and data analysis. We define them here as follows. Data collection refers to acquisition and/or recording of data. Data processing refers to actions aimed to organize, transform, validate, and filter data in an appropriate output form for subsequent use (for example, for a ML training model). Data classification and data validation are considered data processing actions. Data analysis refers to actions performed on data to describe facts, detect patterns, develop explanations, test hypotheses, and predict the distribution of certain items. Modeling species is considered here a data analysis action. We refer to data modeling as a process of analyzing data—for example, on species distribution—and their correlated variables —for example, bioclimatic variables—to identify areas where sensitive species exist or not.

The choice of the two dimensions is one aspect we should take into consideration. Our argument is that the nature of task and skill are fundamental, not just for mapping the distribution of tasks, but also to understand how tasks are allocated to citizens versus experts versus computational technologies. By mapping an array of papers, we expect to observe certain characteristics. For example, we can expect that tasks performed by computational technologies will be on the right side, but it would also be interesting to see if these technologies are mostly doing well-structured tasks and at what level of interdependence, while humans also do ill-structured tasks.

Results

The results section is divided into two parts. First, we provide a descriptive overview of the dataset, including some basic characteristics of the reviewed publications and the fields of the reported citizen science projects. Second, we organize the results around the three review questions of the study, concerned with: the nature of the task performed by citizens, experts, and computational technologies, the skills needed by both humans and computational technologies to perform theirs tasks, and the activities in which citizens, experts, and computational technologies, respectively, perform their tasks. The type of skills were mostly inferred from the description of the tasks, as the reviewed papers generally do not include explicit statements about this dimension. (Supporting Information 2 provides an overview of the tasks and the relative references).

Overview of the dataset

The reviewed papers were published between 2011 and 2020, with the majority (35 out of the 50 papers) published between 2018 and 2020 (Supporting Information 1 contains the list of the 50 papers). This increasing interest in combining AI computational technologies and citizen science (CS) is also evident from the growing diversity of research fields with which the described CS projects are associated. The review demonstrated a considerable variety of citizen science projects (n = 42) with some papers reporting on using data from several projects. The three main areas that attract the most attention across the whole timespan are astronomy and astrophysics (e.g., Galaxy Zoo, Gravity Spy, and Supernova Hunters), biology (e.g., EteRNA, EyeWire, and Project Discovery), and ecology and biodiversity (e.g., eBird, Bat Detective). However, starting from 2017, we observe a larger variation including archeology (e.g., Heritage Quest, and field expeditions), neuroimaging (Braindr.us), seismology (MyShake app), and environmental issues (recruiting volunteers to measure the quality of air or water). Table 4 shows the distribution of the reviewed papers per research area from 2011 to 2020.

Table 4 Distribution of the reviewed papers per research area from 2011 to 2020.

Citizen scientists: nature of the tasks performed, skills needed, and activities

The two main categories of tasks performed by citizen scientists are collecting data and classifying observations. Other tasks include generating new taxonomies, validating the algorithm classification results, solving in-game puzzles, and going through training.

Table 5 provides an overview of the main tasks performed by citizen scientists and the skills they need.

Table 5 Tasks performed by citizen scientists and skills needed across the activities—examples from the literature.

Data collection

This refers to a set of tasks widely assigned to citizen scientists in the areas of ecology, biodiversity, and environmental monitoring. Delegating the collection of data to volunteers allows researchers to map geographical distributions of species and spatial variation in unprecedented scope and detail, which is especially relevant when monitoring by researchers is not feasible or efficient enough. The most common types of data contributed by volunteers include photos of plants or animals, accompanied by some context information (such as location and date/time of observation), and sometimes by a description (e.g., Derville et al., 2018; Capinha 2019). Less common types are videos and audio recordings (e.g., Zilli et al., 2014; Hardison et al., 2019). These observations were often accompanied by species classification, as citizen scientists were asked to submit observations of a particular species (e.g., Jackson et al., 2015). Alternatively, volunteers submitted observations that they classified with the help of an instrument, e.g., the eBird app (Curry et al., 2018), where the mobile app suggestions were used but no photo attached. Several papers also reported on citizens sending a specimen to researchers, e.g., bee trap nests (Kerkow et al., 2020; Everaars et al., 2011).

Another type is relatively passive data collection that does not require analysis on the part of the citizens. Lim et al. (2019) and Adams et al. (2020) reported on projects aimed at sampling air quality: volunteers were equipped with AirBeam sensors and asked to sample several routes by walking or cycling there. In Winter et al. (2019), an Android app was presented that allowed for identifying and classifying charged papers in camera image sensors. The only task outsourced to the citizens in this case was installing the app.

Overall, the reported tasks involve a low-level of complexity as taking good photos of animals or plants, or collecting quality data of air pollution do not depend on the data collected by other volunteers. Regarding the background skills to perform these tasks, citizen scientists seem to need general/common skills, required in routine tasks. However, they can be required to have some training, which was sometimes done face-to-face if citizens were asked to collect specific types of data in the field (Hardison et al., 2019). In other cases, this training occurred mainly online, as citizens went through guidelines prepared by project authors (Keshavan et al., 2019). The training could also be guided, facilitated, and assessed using ML algorithms (Zevin et al., 2017). While we may suggest that all citizen scientists go through some kind of training, not all of the reviewed papers included related information.

Data processing

The second popular set of tasks performed by citizen scientists is related to image analysis and includes classifying images into predefined categories, describing objects by choosing all relevant categories from a predefined list, as well as identifying and counting objects. The research fields setting up citizen science projects to outsource these tasks to volunteers include astronomy and astrophysics, ecology and biodiversity, archeology, biology, and neuroimaging. The tasks are performed in web interfaces: the majority of the projects run on the Zooniverse platform, but there are also separate initiatives such as the Braindr.us website (Keshavan et al., 2019) and the Project Discovery implemented in the Eve online game (Sullivan et al., 2018). Allocating classification tasks to citizen scientists is often related to the extremely large size of currently available datasets that makes expert classification unfeasible. The projects leverage human ability for pattern recognition and benefit from the scope of citizen science projects. The resulting classifications constitute training datasets for computational analysis. Citizen scientists classify objects from images into predefined categories. It can be a binary classification task, e.g., citizens decided whether a supernova candidate is a real or a ‘bogus’ detection (Wright et al., 2017; Wright et al., 2019). Alternatively, there could be a larger number of categories. For example, four studies reported on the Gravity Spy project, where users were presented with spectrograms and asked to classify glitches into predefined categories according to their morphology (Bahaadini et al., 2018; Crowston et al., 2020; Jackson et al., 2020; Zevin et al., 2017). Another task performed by citizen scientists was about describing an object in an image using a set of predefined characteristics. Examples include describing circumstellar debris disk candidates (Nguyen et al., 2018); classification of protein localization patterns in microscopy images (Sullivan et al., 2018); and morphological classification of galaxies (Jiménez et al., 2020; Kuminski et al., 2014; Shamir et al., 2016). Lastly, the projects benefiting from citizen scientists identifying and counting objects asked citizens to identify and locate animals of particular species (Bowley et al., 2019; Torney et al., 2019); mark potential archeological sites (Lambers et al., 2019), and identify and locate Moon craters (Tar et al., 2017), and interstellar bubbles on images (Beaumont et al., 2014; Duo and Offner 2017). Kim et al. (2014) reported on the EyeWire game project, where players contribute to mapping 3D structures of retinal neurons by coloring the area that belongs to one neuron and avoiding coloring other neurons on a 2D slice image. These types of task involve a low-level of complexity as the correct classification of an object does not depend on the classifications of other objects. To perform this type of classification, citizen scientists need common skills, such as identifying and counting objects. More specialized skills, such as good observation skills, can be required to perform tasks related to generating new taxonomies of objects. Coughlin et al. (2019) discussed that Gravity Spy project volunteers did not only classify spectrograms into already known classes of glitches but also suggested new classes, being aided by computational clustering of morphologically similar objects. Citizen scientists also performed validation of algorithm classification (Kress et al., 2018), or object detection results (Lambers et al., 2019). An example is the Leafsnap project where citizens submitted photos of leaves, and if the shape-matching algorithm did not classify the plant with high enough probability, citizens were offered several options to choose from (Kress et al., 2018). In the field of archeology, citizen scientists and heritage managers and/or academic researchers participated in field expeditions to validate discovered archeological objects detected by algorithms in remotely sensed data (Lambers et al., 2019).

Data analysis

In the reviewed papers on games, citizens can perform tasks that differ substantially from the tasks performed in other projects (such as classification). An example is reported by Koodli et al. (2019) and Lee et al. (2014), who discussed the EteRNA project, where players solve two-dimensional puzzles to design sequences that can fold into a target RNA structure. Lee et al. reported that EteRNA volunteer players outperformed previous algorithms in discovering RNA design rules.

Experts: nature of the tasks performed, skills needed, and activities

Tasks performed by experts are the most varied. They include collecting and processing the original data before it is presented to volunteers or algorithms, creating the gold-standard datasets, processing and curating the data collected or classified by citizen scientists, and preparing the training datasets for ML models. Several tasks are related to recruiting, training, and supporting volunteers. Finally, researchers are involved in the evaluation and validation of results. It is important to note that some tasks performed by researchers may not be discussed in papers in detail, since they occur naturally in every project, or because they may not be relevant for the discussion. Therefore, this section outlines only those tasks that are discussed in sufficient detail.

Table 6 provides some examples of the main tasks performed by experts and the skills they need.

Table 6 Tasks performed by experts and skills needed across the activities—examples from the literature.

Data collection

Several studies on biodiversity reported on researchers collecting observation data of species occurrence in the field (Derville et al., 2018; Jackson et al., 2015; Zilli et al., 2014). Researchers also obtained pre-classified data from external sources, such as the records of ladybirds sourced from the UK Biological Records Centre (Terry et al., 2020). These observations together with observational data collected by citizen scientists were further used to train and test computational technologies. When ML methods were used to predict species distribution or environmental conditions (e.g., coral bleaching), researchers were also responsible for sourcing data related to the characteristics of the environment. Examples of such data were mean temperature and precipitation (Capinha, 2019; Jackson et al., 2015), and geospatial data including roads and types of land usage (Lim et al., 2019).

Researchers involved in the development of citizen science projects were responsible for recruiting, training, and supporting volunteers. In those projects where volunteers were asked to collect data in a specific location (e.g., air quality measurements along certain routes, or coral bleaching measures on specific beaches), researchers recruited volunteers and performed face-to-face training (Adams et al., 2020; Hardison et al., 2019; Kumagai et al., 2018). When citizen participation was not bound to a particular space, volunteers received written guidelines (Bowley et al., 2019; Torney et al., 2019; Wright et al., 2017). Supporting user motivation and engagement was another task performed by researchers. Examples include ensuring that volunteers were involved in real classification tasks that led to the advancement of the project (Crowston et al., 2020). In projects that required volunteers to collect observations in the field, researchers followed up on citizens’ contributions (Jackson et al., 2015; Kerkow et al., 2020), and provided online support and feedback (Lambers et al., 2019).

The tasks performed by experts regarding data collection generally require specialized skills to train citizens to use bespoke technologies like in the case of sampling a toxic microalga (Hardison et al., 2019), or to source data with certain environmental characteristics (e.g., Jackson et al., 2015).

Data processing

The data provided by citizen scientists, be it observations or classifications, was processed and curated by researchers. The observations provided by citizen scientists, such as cicada call recordings or ladybird recordings, were classified by field experts to be further used by an ML algorithm (Terry et al., 2020; Zilli et al., 2014). In some projects, original data obtained from cameras or sensors was preprocessed by researchers to be further presented to citizen scientists. For example, the audio recordings from bat observations were split into short sound clips and converted to spectrograms for the Bat Detective project (Mac Aodha et al., 2018), while in the Serengeti Wildebeest Count project, images from trap cameras were filtered to remove the empty ones and thereby reduce the number of images for citizen scientists to classify (Torney et al., 2019).

Other related tasks included processing citizen scientist contributions (e.g., returned bee nests) for future analysis (Everaars et al., 2011; Kerkow et al., 2020); deciding on the number of volunteer votes required before the final classification label for an image was generated and used to train or test ML algorithms (Sullivan et al., 2018; Wright et al., 2019; Wright et al., 2017); and choosing a limited amount of volunteer-produced data for training an algorithm (Koodli et al., 2019). Preparing the training dataset for ML also included tasks such as generating pseudo-absences when the information provided by volunteers only indicates presences observed (Jackson et al., 2015); generating synthetic observations of bubbles in dust emission to improve ML classification (Duo and Offner, 2017); or augmenting the training dataset by transforming existing images to increase the accuracy of ML classification (Dou and Offner, 2017). Performing initial training of algorithms, or calibrating and fine-tuning machine-learning performance involve a high-level of complexity because they depend on the classifications of data done by citizens or experts.

Expert classifications refer to the so called “gold-standard”, which is a quality dataset approved as the most accurate and reliable of its kind and that could be used to measure accuracy and reliability of algorithm results. The development of a gold-standard can be considered a high-level complexity task, as it usually relies on multiple experts agreeing on classifying certain topics with a high degree of certainty. Expert classifications were used to perform the initial training of the algorithm (Crowston et al., 2020; Jackson et al., 2020), to calibrate and fine-tune the machine-learning performance (Beaumont et al., 2014; Jiménez et al., 2020), or to provide the testing set for computational classification methods (Crowston et al., 2020; Tar et al., 2017). Expert classifications were also included in the guidelines for volunteers (Keshavan et al., 2019), used to assess the accuracy of citizen scientists’ classifications and give feedback to volunteers (Jackson et al., 2020; Zevin et al., 2017), as well as to weight each citizen scientist’s vote in the final label based on how much their labels corresponded to the gold-standard set (Keshavan et al., 2019).

Data analysis

Experts were involved in the evaluation of results generated by ML using citizen data. A low error level demonstrated the viability of involving citizen scientists to produce training data for ML. For example, researchers evaluated the predictive accuracy of species distribution models based on the automated identification of citizen observations using CNN (Botella et al., 2018), and the climatic niche of invasive mosquitoes using a support vector machine (Kerkow et al., 2020). Bowley et al. (2019) reported on comparing the results of ML training using citizen data and using expert classifications. Other authors such as Curry et al. (2018) and Jackson et al. (2015) reported on comparing the performance of different ML methods to predict species distribution using citizen data.

Furthermore, the results of ML classifications were compared with manual classifications done by field experts (Nguyen et al., 2018; Pearse et al., 2018; Wright et al., 2017). A similar approach was reported on by Kress et al. (2018) in relation to the Leafsnap, where a deep learning algorithm was used to define the contours of a leaf and visual recognition software was employed to find an ordered set of possible matches for it in the database. However, Leafsnap participants needed to confirm the classification suggestions made by the algorithm. Therefore, in this project, the validation of accuracy referred to the results from both citizen scientists and ML models. Unique among these tasks were the validation procedures reported by Lambers et al. (2019), where experts and citizen scientists together validated the new potential archeological objects identified using ML by going into the field.

AI computational technologies: nature of the tasks performed, skills needed, and activities

In the reviewed papers, there is a variety of computational technologies using machine learning and/or neural network-based paradigms. Interested readers can find more details about the types of technologies and their reported use in the Supplemental Information 1 (Annotated review articles). These technologies used several common ML methods such as classification, regression, transfer learning, deep learning, and clustering. For the definitions of these methods, we refer the readers to Castañón (2019).

The skills developed by computational technologies can be grouped in two main categories: recognition and prediction. Recognition refers to classification and detection of objects in images, but also to clustering to classify data into specific groups. Classification and object detection are the most popular tasks performed by these technologies in a variety of projects in the fields of ecology and biodiversity, astronomy and astrophysics. Classification and object detection use various ML algorithms (e.g., Brut algorithm based on random forest, CNN), which are often based on a supervised paradigm and consist of two main steps: training a classifier with samples that are considered as “gold-standard” (expert classifications) or volunteer consensus data (“ground truth”) or a combination of both, and testing the effectiveness of the classifier using other samples, ensuring that none of the samples of the test set are also used for training.

Prediction refers to making predictions of future outcomes by given data, or reducing the errors of a model. Prediction tasks include predicting environmental conditions (e.g., air quality, or variations in the data); addressing biases in the original data or in citizen scientist classification and detection results; improving performance by learning from citizen scientist contributions, modeling species geographical distribution, and learning from player moves in a game. Table 7 provides some examples of the main tasks performed by technologies and the skills they need.

Table 7 Tasks performed by AI computational technologies and skills needed across the activities—examples from the literature.

Data processing

Examples of classification and object detection include a convolutional neural network and a residual neural network trained on both citizen scientist and expert labels to classify galaxy images (Jiménez et al., 2020), or a deep learning algorithm trained on Serengeti Wildebeest Count project data used for counting wildlife in aerial survey images (Torney et al., 2019). It was argued that with the limited number of citizen scientists and increasingly large databases of images, computational technologies offer an approach to scale up data processing, overcome the analysis ‘bottleneck’ problem, and also relieve some burden from researchers and citizen scientists who would only have to classify enough images for training ML, rather than the whole dataset (Torney et al., 2019; Wright et al., 2017).

Clustering is another task performed by ML (Coughlin et al., 2019; Wright et al., 2019). Coughlin et al. (2019) reported on DIRECT, a transfer learning algorithm, which is a ML method consisting of reusing a model previously developed for a different task. The aim of DIRECT was facilitating the discovery of new glitches by citizen scientists in the Gravity Spy project. Owing to the sheer volume of available images, it is extremely difficult for volunteers to identify new classes by finding a sufficient number of similar objects that do not belong to any of the known classes. Thus, DIRECT clustered similar images together and offered this set to volunteers to make their judgment. Wright et al. (2019) reported on using Deep Embedded Clustering (DEC), a method that learns feature representations and cluster assignments, to produce an initial grouping of similar images. In the Supernova Hunters project, grouped images were shown to citizen scientists, who had to mark all of the objects belonging to one glitch class. Then, citizen scientists’ labels were fed back to the DEC algorithm to make clustering purer. Compared to the standard image-by-image presentation, Wright et al. found that the DEC model helped reduce volunteer effort to label a new dataset to about 18% of the standard approach for gathering labels.

Lambers et al. (2019) also reported on using contributions from citizen scientists to improve the results of a multi-class archeological object detection based on CNN. In their project, volunteers participated in field expeditions to validate archeological objects detected by the algorithm, and the results were used to tune the algorithm object detection results. In other fields, computational technologies such as random forest classification (Thornhill et al., 2017), stacked ensemble model (Lim et al., 2019), and generalized linear model (Kumagai et al., 2018) have been used on data collected by citizen scientists and environmental or urban data collected by scientists, to predict water quality, air quality, and coral bleaching respectively. A CNN has been used to model the distribution of species in biodiversity research, such as the White-tailed Ptarmigan distribution over Vancouver Island (Jackson et al., 2015) or the Asian bush mosquito distribution area (Kerkow et al., 2020). The datasets used for training were usually combined from different sources: observations collected and reported by citizen scientists and sometimes by experts as well, and environmental or climate data extracted by the researchers. In another project, Adams et al. (2020) used an automated ML process to adjust AirBeam sensor measurements, which showed errors during times of high humidity. They employed a temporal adjustment algorithm to mitigate biases or errors in the data collected and/or classified by citizen scientists.

In the area of games, the papers on EteRNA emphasize the development of algorithmic approaches that learn from player actions. Lee et al. (2014) developed and trained the EteRNABot algorithm, incorporating machine-learning regression with five selected and cross-validated design rules discovered by players and used to predict the design of RNA structures. An EternaBrain convolutional neural network (CNN) trained on expert moves was described by Koodli et al. (2019). Based on the test results, the algorithm achieved accuracy levels of 51% in base prediction and 34% in location prediction, indicating that top players’ moves were sufficiently stereotyped to allow a neural network to predict moves with a level of accuracy much higher than chance (p. 2). Data Analysis. A concern addressed through the use of AI computational technologies is that citizen scientists’ participation in data collection may not be uniformly distributed in space and can be skewed toward capturing observations rather than absences. For example, Derville et al. (2018) compared five species distribution models to see how they account for the sampling bias present in nonsystematic citizen science observations of humpback whales. Other papers also discussed employing transfer learning when only a small training dataset was available. An example is Willi et al. (2019), who developed a model based on the data from the Snapshot Serengeti citizen science project, and then applied it in another project where only smaller datasets were available to improve accuracy.

Another reason for applying AI computational technologies is that volunteers may misclassify data due to their varying expertize levels and proneness to human error. The issue is addressed in several ways. Tar et al. (2017) evaluated the false-positive contamination in the Moon Zoo project by utilizing predictive error modeling. Keshavan et al. (2019) compared the citizen science ratings to the golden standard created by experts. In the Galaxy Zoo project, Shamir et al. (2016) proposed using a pattern recognition algorithm to evaluate the consistency of annotations made by individual volunteers.

To measure the expertize of eBird volunteer observers, Kelling et al. (2012) used a probabilistic machine-learning approach. In their study, they used the occupancy-detection experience model to measure the probability of a given species being detected at a given site, and to distinguish expert observers from novice observers who are more likely to misidentify common bird species. Researchers used this approach to provide volunteers with feedback on their observation accuracy and to improve a training dataset for an ML algorithm. In addition to eBird, mentioned above, Gravity Spy is one of the few projects that used feedback and training to evaluate citizen contributions (Crowston et al., 2020; Jackson et al., 2020; Zevin et al., 2017). Volunteers were guided through several training levels with the ML system: first showing glitches belonging to two classes with a high-level of ML-determined probability, and later increasing the number of classes and offering images with lower ML confidence scores as they learned to classify them.

A project profile matrix based on distribution of work: discussion

The interdependence of AI computational technologies and domain experts

The results of the review guided our development of a matrix to classify CS projects on the basis of our adaptation of Franzoni and Sauermann’s (2014) framework. Table 8 presents a summary of examples. Since ill-structured tasks were not found in the reported projects, this sub-dimension is not included in the matrix.

Table 8 Project profile matrix—examples.

We have plotted some of these examples on two axes in Fig. 2. The horizontal axis represents the nature of the task and is broken into three levels of interdependence, while the vertical axis represents ths skills requirements, and is also broken into three.

Fig. 2
figure 2

Nature of the task outsourced to humans and AI computational technologies.

While task complexity is largely defined by Franzoni and Sauermann (2014) from a social perspective, the degree of complexity of tasks changes in CS projects using computational technologies (i.e., experts perform tasks with a high-level of complexity, while citizens and technologies can be assigned tasks with a lower level of complexity). In our review we found that a large share of the CS projects involve citizens in performing tasks that tend to be of low complexity, well-structured, and requiring only skills that are common among the general population. Classifying species or marking potential archeological sites, for example, can involve a large number of individuals working in parallel, independently. At a medium-level of complexity is a moderately complex and relatively well-structured task, such as solving two-dimensional puzzles, which citizens perform in the EteRNA game. In this task, players work individually or in groups to explore different solutions collaboratively. Despite the fact that the game can be played without background scientific knowledge, success lies in visualizing and manipulating the design interface to create two-dimensional structures that can include complex patterns such as lattices, knots and switches (Lee et al., 2014). Moreover, players seem to adapt their puzzle strategies based on the results of the laboratory experiments (Lee et al., 2014). The case suggests some interdependence, as the results reached by individual players can be aggregated into a single outcome, which can be referred to as additive/pooled coordination (Nakatsu et al., 2014), and then reused/adapted by other players.

Unlike citizen scientists, experts primarily work on well-structured and medium- and high-level tasks that require expert skills in specific domains. For example, trained neuroimaging experts created cooperatively a gold-standard dataset to be used by citizens for "amplifying" expert decisions (Keshavan et al., 2019). In this task, expert workers are highly interdependent and are expected to consider what each is doing. Interdependence is demonstrated again by the involvement of trained life scientists who collaborate to compare the predictive accuracy of ML species distribution models. The tasks performed by experts and reported in the reviewed literature did not require common skills. Now let us consider the tasks that computational technologies perform. Unsurprisingly, we see that these tasks are on the right side of the diagram. The results of our study indicate that these technologies are capable of performing mostly well-structured, high-level tasks. Tasks appear interdependent in a sequential manner or reciprocal manner. Sequential interdependence takes place when the output of a task serves as the input into another (Haeussler and Sauermann, 2015). This seems to be the case when computational technologies mitigate errors in the data provided by citizens. Reciprocal interdependence refers to tasks that depend on each other and can require a mutual adjustment (Haeussler and Sauermann, 2015). For example, when performing tasks like clustering and classifying images, or predicting environmental conditions, these technologies need to build on human work in a reciprocal fashion, as they must be trained on specific datasets to develop a predictive model and then deploy it. Then experts need to check and validate the results produced by the algorithmic model and adjust such a model when necessary.

The skills of computational technologies make them a scalable complement to citizens and researchers, for example by structuring large amounts of unfiltered data into information, or estimating the probability of an occurrence of an event based on input data. However, to assume that machine learning and other computational technologies can replace humans entirely in citizen science is to downplay their current limited autonomy and “smartness”, as they still require human intervention of experts and engaged citizens. The distribution of tasks resulting from the review indicates that experts work on and with computational technologies. For example, they work on them by training models, but once models are trained, they still require a human expert-in-the-loop to work with them to interpret their predictions and possibly refine them to acquire the most accurate results for unseen and unknown data (Budda et al., 2021).

Having examined the tasks performed by computational technologies and the rationale on which functions are allocated to them in CS projects, we can infer mechanisms that make certain tasks more suitable for existing computational methods. According to these mechanisms, Brynjolfsson and Mitchell (2017) set criteria to identify tasks that are likely to be suitable for ML, based on the currently dominant paradigm, particularly supervised learning. Brynjolfsson and Mitchell’s criteria include (a) learning a function that maps well-defined inputs to well-defined outputs, as in the classification of images, and the prediction of the likelihood of events; (b) the task provides a clear feedback, and goals and metrics for performance are clearly defined. When training data are labeled according to gold standards, for example, ML is particularly powerful to achieve its set goals; (c) ML excels when learning empirical associations in data, but is less successful when long chains of reasoning or complex planning require common sense or background knowledge that is unknown to the computer; (d) tolerance to errors of the learned system, as most ML algorithms derive their solutions statistically and probabilistically. As a result, it is seldom possible to train them to obtain total accuracy - even the best object recognition computer systems make errors, and (e) performing tasks where the inability of ML to explain why or how they made a certain decision is not critical. Brynjolfsson and Mitchell made the example of systems capable of diagnosing types of cancer as well as or better than expert doctors, but unable to explain why or how they came up with the diagnosis.

However, ML will be advanced, and other methods will be more suitable for different tasks. Cognitive work between humans and computational technologies will be shifting, challenging the ontological boundaries between them. Hence, we should be careful not to essentialize the qualities of humans and machines, both of which are constantly evolving, and whose lists of what each is "good at" (whether relative or absolute) are constantly changing. The processing power and the sophistication of algorithms have already increased at previously unimaginable levels, and, for example, some computer programs outperform humans in abstract games (Gibney, 2016), or at image recognition (Johnson, 2015). However, some scholars might argue that these are rather narrow domains, which cannot compare to the complexities of cognitive, emotional, and social human abilities (e.g., Dignum, 2019).

Scientists and AI computational technologies: will the role of citizens become unnecessary?

A large share of the CS projects involve citizens in performing tasks in contributory projects that are “designed by scientists and for which members of the public primarily contribute data” (Shirk et al., 2012). The results of our review indicate a trend towards task polarization, with citizens performing well-structured and low-complexity tasks requiring primarily common skills, and experts performing well-structured and higher-level of complexity tasks requiring training and specialization. As technology races ahead, both types of task seem susceptible to computerization though, with both citizens and experts being reallocated to tasks that are not or less susceptible to computerization, i.e., tasks requiring creative and social intelligence. To this regard, the differentiation between task and skill made by Autor et al. (2003) is useful: task denotes a unit of activity performed at work and it produces output, while the concept of skill refers to the human capabilities required to perform a task. The Routine-Biased Technological Change (RBTC) (Arntz et al., 2016) approach builds on this differentiation and analyzes tasks according to the routine and nonroutine axis. Following this approach, a job’s substitutability is determined by the number of routine tasks it requires, as opposed to the level of skills it needs (Arntz et al., 2016). Routine tasks can be performed both manually or cognitively, while nonroutine tasks, also known as abstract tasks, involve problem-solving, intuition, and creativity. Routine tasks that follow a well-defined practice can be more easily codified and performed automatically by algorithms. In the reviewed CS projects, routine tasks, such as collecting data, counting or demarcating objects in images, seem to be prevalent—although not exclusively—in citizens’ contributions. Even classification of objects following authoritative taxonomies can be considered a routine task that can be codified and performed by algorithms. Almost any task in CS projects reliant on pattern recognition is susceptible to automation, as long as adequate data are collected for training algorithms (Frey and Osborne, 2013).

Citizens are more likely to be involved in nonroutine tasks when playing games. Exemplary is Foldit where players have the opportunity to use and write and adapt recipes to manage the increasing complexity of the game (Cooper et al., 2011). Recipes are computer programs that allow players to interact automatically with a protein and repeat simple routines consistently, or perform a series of complex routines, which keep running in the background for ever. Although recipes embed a number of simple, time-consuming and repetitive manual actions, they have not yet replaced the skills that citizens learn over time, through training and playing the game intensively, and that are needed to perform nonroutine tasks in the game (Ponti et al., 2018). When it comes to designing RNA sequences that fold into particular shapes, computational technologies have proved to be second to humans, even after being endowed with insights from the human folders (Lee et al., 2014).

Regarding the use of skills, automation literature authors (e.g., Brynjolffson and Mcafee, 2016; Eurofound 2018; Goos et al., 2019) commonly assert that technological developments have upgraded the skill requirements for occupations by complementing the skills of highly skilled, highly educated professionals. However, these technologies have lowered the demand for lower-skilled, less educated workers, as the tasks they perform are more susceptible to replacement by technologies. In the context of CS, though, it is not clear whether the use of algorithms could result in the disappearance of "low skill" roles and, by any means, it is unclear what we refer to when we talk about low-skilled or unskilled work in this specific area. Even tasks that are considered low-skill may not mean they can be done easily by computational technologies.

Conclusion and future research

A growing number of CS projects use human efforts and computational technologies. Yet the distribution of tasks among experts, citizen scientists, and this type of technologies does not seem to have been considered. Even though task allocation has long been a central topic in the literature, we are unaware of previous studies examining this topic in the context of citizen science.

We summarized the results of an integrative review to illustrate the current state of the distribution of tasks between humans and computational technologies in CS. We used an adapted version of the framework developed by Franzoni and Sauermann (2014) to analyze the results and highlight the differences in the nature of the task and the skills contributed by humans and computational technologies to perform those tasks. We hope that this framework may provide a useful tool for analyzing task allocation not just in citizen science but also in other contexts where knowledge production involves a crowd and computational technologies. The presented framework may also be useful not only for “mapping” projects, but also for inferring underlying causal mechanisms, such as, for example, that computational technologies seem to be better at certain tasks. An important next step would be to learn why certain CS projects cannot be entirely performed algorithmically and still require human contribution, while others are already suited for full automation (Rafner et al., 2021).

Although we conducted this review to include all relevant papers, there can be papers we did not include as we left out preprints, reports, and other types of non-peer-reviewed literature. Future research could consider using a larger number of databases, publication types and publication languages, in order to widen the scope of the review. Furthermore, we have two limitations related to our sampling strategy. First, we did not search catalogs of CS projects (e.g., Scistarter). Our current approach to search on publication databases has only retrieved articles that have been written about projects. This strategy may have introduced certain biases—for example, we may not have captured projects run by non-scientists (since they may not care to publish a paper about their project). We can assume that most research papers report on successful rather than unsuccessful projects, meaning we have been exposed to mostly successful divisions of labor instead of divisions that did not work. Second, we acknowledge that using Baert’s (2019) game list has resulted in selection bias. As to our knowledge, there is not a complete sampling frame listing all the existing citizen science games from which to draw a representative sample. Therefore, we opted for a convenience sample, where the sample is taken from the game list, in addition to the articles selected for this review. Specifically, our convenience sample is a purposive sample, as we relied on our judgment when choosing the games to include in the sample (Fricker, 2012).

Overall, the projects reported in the literature put an emphasis on process optimization. The use of computational technologies presents opportunities in citizen science to improve speed, accuracy, and efficiency to analyze massive datasets and improve scientific discovery. However, as mentioned earlier in this paper, concerns have been raised over the potential risks of disengaging citizen scientists by reducing the range of their possible contributions or making them either too simple or too complex (Trouille et al., 2019; Leach et al., 2020). If citizens come to think that the only thing they are good at can be replaced by machine learning, they may feel left out and useless (Ponti et al., 2021).

Citizen science projects are not ordinary workplaces. Unpaid citizens volunteer time and effort, therefore, deriving personal meaning and value from performing a task is important to sustain engagement. Arguably, if the organizers of CS projects focus primarily on efficiency and productivity goals, they may replace citizens as much as possible, and only make tasks as meaningful as needed to keep volunteers engaged. In contrast, if organizers also want to achieve “democratization” goals, they will use computational technologies more for the benefit of human engagement and may even use citizens if those technologies could do a more efficient job. Currently, the difference may not matter much because computational technologies are not yet capable to replace humans entirely. However, this difference can become critical once AI is more powerful—then organizers will have to decide if they intend to maximize efficiency by replacing citizens, or try to achieve/maximize engagement by keeping citizens in the loop and use computational technologies to make tasks more interesting and meaningful for people. Future research could look more closely on whether the use of computation technologies benefits a variety of citizens or only certain groups (e.g., those with more technical expertize), and whether certain cultural domains are more facilitated.

We are aware that one size does not fit all. A boring task for one person can be a joy for another, while some volunteers may prefer to engage their brains and choose more complex tasks. Nevertheless, taking into account meaningful roles that citizen scientists can play alongside experts and computational technologies, for example in the form of additional data validation and other types of human-based analysis that strengthen analytical rigor or the diversity of analytical lenses, remains an unavoidable design issue of task assignment. It may be useful to complement the focus on efficiency and speed with greater attention to other goals such as volunteer learning and development, an issue that becomes particularly salient when we think about division of labor in the context of democratization of science, diversity, and inclusion, which are long-standing challenges in citizen science (Cooper et al., 2021).