An open source machine learning framework for efficient and transparent systematic reviews

To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool to accelerate the step of screening titles and abstracts. For many tasks—including but not limited to systematic reviews and meta-analyses—the scientific literature needs to be checked systematically. Scholars and practitioners currently screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that active learning can yield far more efficient reviewing than manual reviewing while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice. It is a challenging task for any research field to screen the literature and determine what needs to be included in a systematic review in a transparent way. A new open source machine learning framework called ASReview, which employs active learning and offers a range of machine learning models, can check the literature efficiently and systemically.


Main Text
With the emergence of online publishing, the number of scientific papers on many topics is skyrocketing. 1All these textual data present opportunities to scholars and practitioners, while simultaneously confronting them with new challenges.To develop comprehensive overviews of the relevant topics, scholars often develop systematic reviews and meta-analyses. 2 The process entails several explicit and, ideally, reproducible steps, including identifying all likely relevant publications in a standardized way, extracting data from eligible studies, and synthesizing the results.Systematic reviews differ from traditional literature reviews in that they are more replicable and transparent. 3,4 Given that screening the entire research literature on a given topic is too labor intensive, scholars often develop quite narrow searches.Developing a search strategy for a systematic review is an iterative process aimed at balancing recall and precision; 8,9 that is, including as many potentially relevant studies as possible, while at the same time limiting the total number of studies retrieved.
Often, the vast number of publications in the field of study may lead to a relatively precise search, with the risk of missing relevant studies.The process of systematic reviewing is error-prone and extremely time intensive. 10In fact, if a field's literature is growing faster than the amount of time available for systematic reviews, adequate manual review of this field becomes impossible. 112][13][14] It offers approaches to overcome the manual and time-consuming screening of large numbers of studies by prioritizing relevant studies using active learning. 157][18] In most, so-called, Human-in-the-Loop (HITL) 19 machine learning applications, the interaction between the machine learning algorithm and the human is used to train a model with a minimum number of labeling tasks.Unique for systematic reviewing is that not only all relevant records (i.e., titles and abstracts) should be seen by a researcher, but also an extremely diverse range of concepts needs to be learned, thereby requiring flexibility in the modeling approach as well as careful error evaluation. 11In the case of systematic reviewing, the algorithm(s) are interactively optimized for finding the most relevant records, instead of finding the most accurate model.Therefore, the term Researcher-In-The-Loop (RITL) was introduced 20 as a special case of HITL with three unique components: (1) The primary output of the process is a selection of the records, not a trained machine learning model, (2) All records in the relevant selection are seen by a human at the end of the process 21 , (3) The use-case requires a reproducible workflow and complete transparency is required. 22.
Existing tools implementing such an active learning cycle for systematic reviewing are described in Table 1, see the Appendix for an overview of all the software we considered (note this list was based on a review of software tools 12 ).However, existing tools have two main drawbacks.First, many are closed source applications with black box algorithms.This is problematic as transparency and data ownership are essential in the era of open science 22 .Second, to our knowledge, existing tools lack the necessary flexibility to deal with the large range of possible concepts to be learned by a screening machine.For example, in systematic reviews, the optimal type of classifier will depend on variable parameters, such as the proportion of relevant publications in the initial search and the complexity of the inclusion criteria used by the researcher. 23For this reason any successful system must allow for a wide range of classifier types.Benchmark testing is crucial to understand the real-world performance of any ML-aided system, but currently, such benchmark options are mostly lacking.

<< TABLE 1 >>
In this paper, we present an open source ML-aided pipeline with active learning for systematic reviews called ASReview.The goal of ASReview is to help scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible, while being transparent in the process.The open, free and ready-to-use software ASReview addresses all concerns mentioned above: it is open source, uses active learning, allows multiple ML-models.It also has a benchmark mode which is especially useful for comparing and designing algorithms.Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline.Although we focus this paper on systematic reviews, ASReview can handle any text source.
In what follows, we first present the pipeline for manual versus ML-aided systematic reviews.Subsequently, we present how ASReview has been set up, and how ASReview can be used in different workflows by presenting several real-world use cases.Then, we present the results of simulations that benchmark performance, and present the results of a series of user-experience tests.Last, we discuss future directions.

Pipeline for manual and ML-aided systematic reviews
Traditionally, the pipeline of a systematic review without active learning starts with researchers doing a comprehensive search in multiple databases 24 , using free text words as well as controlled vocabulary to retrieve potentially relevant references.The researcher then typically verifies that key papers they expect to find are indeed included in the search results.The researcher downloads a file with records containing the text to be screened.In the case of systematic reviewing it contains the titles and abstracts, and potentially other metadata like authors, journal, DOI, of potentially relevant references into a reference manager.Ideally, two or more researchers then screen the records' titles and abstracts based on eligibility criteria established beforehand. 4ter all records have been screened, the full texts of the potentially relevant records are read to analyze which will be ultimately included in the review.Most records are excluded in the title and abstract phase.Typically, only a small fraction of the records belong to the relevant class, making title and abstract screening an important bottleneck in systematic reviewing process 25 .For instance, a recent study analyzed 10,115 records, and excluded 9,847 after title and abstract screening, a drop of more than 95%. 26Therefore, ASReview focuses on this labor-intensive step.
The research pipeline of ASReview is depicted in Figure 1.The researcher starts with a search exactly as described above, and subsequently uploads a file containing the records (i.e.metadata containing the text of the titles and abstracts) into the software.Prior knowledge is then selected which is used for training of the first model and presenting the first record to the researcher.
Because screening is a binary classification problem, the reviewer must select at least one key record to include and exclude based on background knowledge.More prior knowledge may result in improved efficiency of the active learning process.

<< FIGURE 1>>
Based on the prior knowledge, a machine learning classifier is trained to predict study relevance (labels) from a representation of the record containing text (feature space).In order to prevent "authority bias" in the inclusions, we have purposefully chosen not to include an author name or citation network representation in the feature space.In the active learning cycle, the software presents one new record to be screened and labeled (1 -"relevant" vs. 0 -"irrelevant") by the user.The user's binary label is subsequently used to train a new model, after which a new record is presented to the user.This cycle continues up to a certain user-specified stopping criterion has been reached.The user now has a file with (1) records labeled as either relevant or irrelevant and (2) unlabeled records ordered from most to least probable to be relevant as predicted by the current model.This setup helps to move through a large database much quicker than in the manual process, while, at the same time, the decision process remains transparent.

Software implementation: ASReview
The source code 27 of ASReview is available open source under an Apache-2.0license, including documentation 28 .Compiled and packaged versions of the software are available on the Python Package Index 29 or Docker Hub 30 .The free and ready-to-use software ASReview implements an 'oracle', a 'simulation' and an 'exploration' mode.The oracle mode is used to perform a systematic review with interaction by the user.The simulation mode is used for simulation of the ASReview performance on existing systematic reviews.The exploration mode can be used for teaching purposes and includes several pre-loaded labeled datasets.
The oracle mode presents records to the researcher, and the researcher classifies these.Multiple file formats are supported: (1) RIS files are used by digital libraries, like IEEE Xplore, Scopus and ScienceDirect.Citation managers Mendeley, RefWorks, Zotero, and EndNote support the RIS format as well.(2) Tabular datasets with extensions .csv,.xlsx,and .xls.CSV files should be comma separated and UTF-8 encoded.For CSV files, the software accepts a set of predetermined labels in line with the ones used in RIS files.Each record in the dataset should hold metadata on, for example, a paper.Mandatory metadata is text and can for example be titles or abstracts from scientific papers.If available, both are used to train the model, but at least one is needed.An advanced option is available which splits the title and abstracts in the feature extraction step and weights the two feature matrices independently (for TF-IDF only).Other metadata such as author, date, URL, DOI, and keywords are optional but not used for training the models.When using ASReview in simulation or exploration mode, an additional binary variable to indicate historical labeling decisions is required.This column, which is automatically detected, can also be used in the oracle mode as background knowledge for prior selection of relevant papers before entering the active learning cycle.If not available the user has to select at least one relevant record which can be identified by searching the pool of records.Also, at least one irrelevant record should be identified; the software allows to search for specific records or presents random records which are most likely to be irrelevant due to the extremely imbalanced data.
The software has a simple yet extensible default model: a Naive Bayes classifier, TF-IDF feature extraction, Dynamic Resampling balance strategy 31 , and certainty-based sampling 17,32 for the query strategy.These defaults were chosen based on their consistently high performance in benchmark experiments across several datasets 31 .Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally.Users can change the settings, shown in Table 2, and technical details are described in our documentation 28 .Users can also add their own classifiers, feature extraction techniques, query strategies and balance strategies.

<< TABLE 2 >>
ASReview has a number of implemented features (see Table 2).First, there are several classifiers available: (1) naive Bayes, (2) support vector machines, (3) logistic regression, (4) neural networks, (5) random forests, (6) LSTM-base which consists of an embedding layer, an LSTM layer with one output, a dense layer, and a single sigmoid output node, and (7) LSTM-pool which consists of an embedding layer, an LSTM layer with many outputs, a max pooling layer, and single sigmoid output node.Feature extraction techniques available are Doc2Vec, 33 embedding with IDF or TF-IDF 34 (the default is unigram, with the option to run n-grams, while other parameters are set to the defaults of Scikit-learn 35 ), and sBERT. 36The available query strategies for the active learning part are (1) Random selection, ignoring model assigned probabilities, (2) Uncertainty-based sampling which chooses the most uncertain record according to the model (i.e.closest to 0.5 probability), (3) Certainty-based sampling ("Max" in ASReview) which chooses the record most likely to be included according to the model, and (4) Mixed sampling which uses a combination of random and certainty-based sampling.
There are several balance strategies that rebalance and reorder the training data.This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) Full sampling which uses all the labeled records, (2) Undersampling the irrelevant records, so that the included and excluded records are in some particular ratio (closer to one), and (3) "Dynamic Resampling", a novel method similar to undersampling in that it decreases the imbalance of the training data 31 .However, in Dynamic Resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same.The ratio between relevant and irrelevant records is not fixed over interactions, but dynamically updated, depending on the number of labeled records, the total number of records and the ratio between relevant and irrelevant records.Details on all the described algorithms can be found in the code and documentation referred to above.
By default, ASReview converts the records' texts into a document-term matrix, terms are converted to lowercase, and no stop words are removed as default (but this can be changed).
Because the document-term matrix is identical in each iteration of the active learning cycle, it is generated in advance of model training, and stored in the (active learning) state file.The indexed records can easily be requested from the document-term matrix in the state file.Internally, records are identified by their row number in the input dataset.In "oracle mode", the record that is selected to be classified is retrieved from the state file and the record text and other metadata (such as title and abstract) are retrieved from the original dataset (from file or computer memory).
ASReview can run on your local computer, or a (self-hosted) local or remote server.Data -all records and their labels -remain on the users' computer.Data ownership and confidentiality is crucial, and no data is processed or used in any way by third parties.This stands in distinction with some of the existing systems, as shown in the last column of Table 1.

Real world use-cases and high-level function descriptions
Below we highlight a number of real-world use cases and high-level function descriptions for using the pipeline of ASReview.
ASReview can be integrated in classic systematic reviews or meta-analyses.Such reviews or meta-analyses entail several explicit and reproducible steps, as outlined in the PRISMA guidelines .4Scholars identify all likely relevant publications in a standardized way, screen retrieved publications to select eligible studies based on defined eligibility criteria, extract data from eligible studies and synthesize the results.ASReview fits in this process, particularly in the abstract screening phase.ASReview does not replace the initial step of collecting all potentially relevant studies.As such, results from ASReview depend on the quality of the initial search process, including selection of databases 24 and construction of comprehensive searches using keywords and controlled vocabulary.However, ASReview can be used to broaden the scope of the search, by keyword expansion or by omitting limitation in the search query, resulting in a higher number of initial papers to limit the risk of missing relevant papers during the search part (i.e., more focus on recall instead of precision).Also, when analyzing very large literature streams, many reviewers nowadays move towards meta-reviews, that is, systematic reviews of systematic reviews. 37This can be problematic as the various reviews included could use different eligibility criteria and therefore are not always directly comparable.Because of the efficiency of ASReview, scholars using the tool could conduct the study by analyzing the papers directly instead of using the systematic reviews.Furthermore, ASReview supports the rapid update of a systematic review.The included papers from the initial review are used to train the machine learning model before screening of the updated set of papers starts.This allows the researcher to quickly screen the updated set of papers based on decisions made in the initial run.
As an example case, let us look at the current literature on COVID-19 and the coronavirus.An enormous number of papers are being published on COVID-19 and the coronavirus.It is very time consuming to manually find relevant papers, for example to develop treatment guidelines.This is especially problematic as urgent overviews are required.Medical guidelines rely on comprehensive systematic reviews, but the medical literature is growing at breakneck pace, and the quality of the research is not universally adequate for summarization into policy. 38 and other coronavirus research (e.g.SARS, MERS, etc.) from PubMed Central, the WHO COVID-19 database of publications, the preprint servers bioRxiv and medRxiv and papers contributed by specific publishers 40 .The CORD-19 dataset is updated daily by the Allen Institute for AI and updated also daily in the plugin.(2) In addition to the full dataset, we construct automatically a daily subset of the database with studies published after December 1 st , 2019 to search for relevant papers published during the COVID-19 crisis.(3) A separate dataset of COVID-19 related preprints, containing metadata of preprints from over 15 preprints servers across disciplines, published since January 1, 2020. 41The preprint dataset is updated weekly by the maintainers and then automatically updated in ASReview as well.As this dataset is not readily available to researchers through regular search engines (e.g.PubMed), its inclusion in ASReview provided added value to researchers interested in COVID-19 research, especially if they want a quick way to screen preprints specifically.

Simulation study
To evaluate the performance of ASReview on a labeled dataset, users can employ the simulation mode.As an example, we ran simulations based on four labeled datasets with version 0.7.2 of ASReview.All scripts to reproduce the results in this paper can be found on Zenodo (doi:10.5281/zenodo.4024122) 42nd the results are available at OSF (doi:10.17605/OSF.IO/2JKD6) 43 .

<< FIGURE 2 >>
Datasets.First, we analyzed the performance for a study systematically describing studies that performed viral Metagenomic Next-Generation Sequencing (mNGS) in common livestock such as cattle, small ruminants, poultry, and pigs. 44Studies were retrieved from Embase (n = 1,806), Medline (n = 1,384), Cochrane Central (n = 1), Web of Science (n = 977), and Google Scholar (n = 200, the top relevant references).After deduplication this led to 2,481 studies obtained in the initial search, of which 120 inclusions (4.84%).
A second simulation study was performed on the results for a systematic review of studies on fault prediction in software engineering 45 .Studies were obtained from ACM Digital Library, IEEExplore and the ISI Web of Science.Additionally, a snowballing strategy and a manual search were conducted, accumulating to 8,911 publications of which 104 were included in the systematic review (1.2%).
A third simulation study was performed on a review of longitudinal studies that applied unsupervised machine learning techniques on longitudinal data of self-reported symptoms of posttraumatic stress assessed after trauma exposure 46,47 5,782 studies were obtained by searching Pubmed, Embase, PsychInfo, and Scopus, and through a snowballing strategy in which both the references and the citation of the included papers were screened.Thirty-eight studies were included in the review (0.66%).
A fourth simulation study was performed on the results for a systematic review on the efficacy of Angiotensin-converting enzyme (ACE) inhibitors, from a study collecting various systematic review datasets from the medical sciences 15 .The collection is a subset of 2,544 publications from the TREC 2004 Genomics Track document corpus 48 .This is a static subset from all MEDLINE records from 1994 through 2003, which allows for replicability of results.Forty-one publications were included in the review (1.6%).
Performance Metrics.We evaluated the four datasets using three performance metrics.First, we assess the "Work Saved over Sampling" (WSS).WSS is the percentage reduction in the number of records needed to screen that is achieved by using the program instead of screening records at random.WSS is measured at a given level of recall of relevant records, for example 95%, indicating the work reduction in screening effort at the cost of failing to detect 5% of the relevant records.For some researchers it is essential that all relevant literature on the topic is retrieved; this entails that the recall should be 100% (i.e., WSS@100%).Note that to be sure to detect 100% of relevant records, all records need to be screened, therefore leading to no time savings.We also propose the amount of Relevant References Found after having screened the first 10% of the records, RRF10%.This is a useful metric for getting a quick overview of the relevant literature.
Results.For every dataset, 15 runs were performed with one random inclusion and one random exclusion, see Figure 2. The classical review performance with randomly found inclusions is shown by the dashed line.The average work saved over sampling at 95% recall for ASReview is 83% and ranges from 67% to 92%.Hence, 95% of the eligible studies will be found after screening between only 8% to 33% of the studies.Furthermore, the number of relevant abstracts found after reading 10% of the abstracts ranges from 70% to 100%.In short, our software would have saved many hours of work.

Usability Testing (UX-Testing)
We conducted a series of user experience tests to learn from end users how they experience the software and implement it in their workflow.The study was approved by the Ethics Committee of the Faculty of Social and Behavioral Sciences of Utrecht University (ID 20-104).
Unstructured Interviews.The first user experience (UX) test, carried out in December 2019, was conducted with an academic research team in a substantive research field (public administration and organizational science) that has conducted various systematic reviews and meta-analyses.It was composed of three university professors (ranging from assistant to full) and three PhD candidates.In one 3.5-hour session, the participants used the software and provided feedback via unstructured interviews and group discussions.The goal was to provide feedback on installing the software and testing the performance on their own data.After these sessions we prioritized the feedback in a meeting with the ASReview team which resulted in the release of v0.4 49 and v0.6 50 .An overview of all releases can be found on GitHub 27 .
A second UX-test was conducted with four experienced researchers developing medical guidelines based on classical systematic reviews, and two experienced reviewers working at a pharmaceutical non-profit organization who work on updating reviews with new data.In four sessions, held in February-March 2020, these users tested the software following our testing protocol .After each session we implemented the feedback provided by the experts and asked them to review the software again.The main feedback was about how to upload datasets and select prior papers.Their feedback resulted in the release of v0.7 51 and v0.9 52 .To analyze the notes, thematic analysis was used, which is a method to analyze data by dividing the information in subjects that all have a different meaning 54 using the software Nvivo 12 55 .When something went wrong the text was coded as "showstopper".When something did not go smoothly the text was coded as "doubtful".When something went well the subject was coded as "superb".The features the participants requested for future versions of the ASReview tool were discussed with the lead engineer of the ASReview team and were submitted to GitHub as issues or feature requests.

Systematic
The answers to the quantitative questions can be found at the Open Science Framework 56 .The participants (N=11) rated the tool with a grade of 7.9 (SD = 0.9) on a scale from one to ten (Table 2).The unexperienced users on average rated the tool with an 8.0 (SD= 1.1, N=6).The experienced user on average rated the tool with a 7.8 (SD= 0.9, N=5).The participants described the usability test with words such as "helpful", "accessible", "fun", "clear" and "obvious".
The UX-tests resulted in the new release v0.10 57 , v0.10.1 58 and the major release v0.11 59 , which is a major revision of the GUI.The documentation has been upgraded to make installing and launching ASReview more straightforward.We made setting up the project, selecting a dataset and finding prior knowledge is more intuitive and flexible.In addition, we added a project dashboard with information on your progress and advanced settings.

Conclusion
To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a system to accelerate the step of screening titles and abstracts.Our ASReview with state-of-the-art systems across a wide range of real-world systematic reviewing applications.Based on our experiments, ASReview provides defaults on its parameters which exhibited good performance on average across the applications we examined.However, we stress that in practical applications, these defaults should be carefully examined; for this purpose, the software provides a simulation mode to users.We encourage users and developers to perform further evaluation of the proposed approach in their application, and to take advantage of the project's open source nature by contributing additional developments.
Drawbacks of ML-based screening systems, including our own, remain.First, while the active learning step greatly reduces the number of papers that must be screened, it also prevents a straightforward evaluation of the system's error rates without further onerous labeling.Providing users with an accurate estimate of the system's error rate in the application at hand is therefore a pressing open problem.Second, while, as argued above, the use of such systems is not limited in principle to reviewing, to our knowledge no empirical benchmarks of actual performance in these other situations yet exist.Third, ML-based screening systems automate the screening step only; while the screening step is time-consuming and a good target for automation, it is just one part of a much larger process, including the initial search, data extraction, coding for risk of bias, summarizing results, etc.While some other work, similar to our own, has looked at (semi- )automating some of these steps in isolation 60,61 , to our knowledge the field is still far removed from an integrated system that would truly automate the review process while guaranteeing the quality of the produced evidence synthesis.Integrating the various tools that are currently under development to aid the systematic reviewing pipeline is therefore a worthwhile topic for future development.
Possible future research could also focus on the performance of identifying full text articles with different document length and domain-specific terminologies or even other types of text, such as newspaper articles and court cases.When the selection of prior knowledge is not possible based on expert knowledge, alternative methods could be explored.For example, unsupervised learning or pseudo-labeling algorithms could be used to improve training 62,63 .In addition, as the NLP community pushes forward the state of the art in feature extraction methods, these are easily added to our system as well.In all cases, performance benefits should be carefully evaluated using benchmarks for the task at hand.To this end, common benchmark challenges should be constructed that allow for an even comparison of the various tools now available.To facilitate such a benchmark, we have constructed a repository of publicly available systematic reviewing datasets 64 .
The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text.We invite the community to contribute to open source projects such as our own, as well as to common benchmark challenges, so that we can provide measurable and reproducible improvement over current practice.
GitHub 65 , and includes documentation (doi:10.5281/zenodo.4287120) 28.The answers to the quantitative questions of the UX-test can be found at the Open Science Framework (OSF.IO/7PQNM) 56 .Active learning starts after: One label.

Retraining: Asynchronous
Stopping: Is currently left to the reviewer.
Software does not have access to user data, because the program runs locally.
Stopping: Is left to the reviewer.
No terms and conditions available.
The Colandr team was contacted and they ensured the user can remove data any time.In the future, user data will be used to improve Colandr but only if granted permission from the project owner FASTREAD 68 Classifier: SVM.

Model inputs: Title and abstract.
Feature extraction: TF-IDF.

Label options: Relevant; irrelevant
Query strategy: Uncertainty sampling; Certainty sampling.Users are allowed to switch between active learning types after 30 inclusions.
Balance strategy: Mix of weighting and aggressive undersampling.
Active learning starts after: One relevant abstract is retrieved (through querying random abstracts).

Stopping:
The number of relevant abstracts is estimated by semi-supervised learning.
Software does not have access to user data, because the program runs locally.
Model inputs: User-provided key terms and citation (title and abstract).Query strategy: Rayyan predicts a relevancy of a citation on a 5-star scale.The user can order citations by their predicted relevancy.
Active learning starts after: Unknown.
Retraining: Unknown; 'as the user is labelling citations.
Stopping: When there are no more citations to be labeled or when the model can no longer be improved.

Rayyan Terms of
Such reviews must entail adequate protocols with explicit and reproducible steps, including identifying all potentially relevant papers, extracting data from eligible studies, assessing potential for bias, and synthesizing the results into medical guidelines.Researchers need to screen (tens of) thousands of COVID-19 related studies by hand to find relevant papers to include in their overview.Using ASReview, this can be done far more efficiently by selecting key papers that match their (COVID-19) research question in the first step; this should start the active learning cycle and lead to the most relevant COVID-19 papers for their research question being presented next.Therefore, a plug-in was developed for ASReview 39 containing three databases which are updated automatically whenever a new version is released by the owners of the data: (1) The Cord19 database, developed by the Allen Institute for AI, with over all publications on COVID-19 UX-Test.In May 2020, we conducted a systematic UX-test.Two groups of users were distinguished: an unexperienced group and an experienced user who already used ASReview.Due to the COVID-19 lockdown the usability tests were conducted via video calling where one person gave instructions to the participant and one person observed, called humanmoderated remote testing 53 .During the tests, one person (SH) asked the questions and helped the participant with the tasks, the other person observed and made notes, a user experience professional at the IT-department of Utrecht University (MH).
Open Source Community.Finally, the ASReview development team receives continuous feedback from the open science community about, among other things, the user experience.In every new release we implement features listed by our users.Recurring UXtests are done to keep up with the needs of users and improve the value of the tool.
system uses active learning to train an ML model that predicts relevance from texts using a limited number of labeled examples.The classifier, feature extraction technique, balance strategy, and active learning query strategy are flexible.We provide an open source software implementation,

Figure 2 .
Figure 2. Results of the simulation study for the results for a study systematically review studies

Table 1 .
In this table we provide an overview of those tools that implemented active learning and describe what machine learning algorithms have been implemented, which active learning features are available and information about privacy policy.As a starting point we used the systematic review12describing ML-aided software tools for systematic reviewing.In TableA1in the Appendix we provide an overview of all tools found by Harrison et al. and indicate which tools implemented machine learning and/or active learning and are open source.Note that we added FASTREAD, RobotAnalyst and ASReview to the overview which were not described byHarrison et al.
Service: 3.1: "Rayyan, may use any User data and information to evaluate and improve its performance and expand its services."3.4: "This Agreement is governed by the laws of the State of Qatar.By accessing this Rayyan website you consent to these terms and conditions and to the exclusive jurisdiction of the Qatar courts in all disputes arising out of such access."9.2.2: "Rayyan does not own User Content.The User retains the copyright of their Content.…"

Table 2 .
Implemented classifiers, feature extraction techniques, query strategies and balance strategies available in ASReview.Note that, not all combinations are possible.For example, the naive Bayes classifier cannot handle a feature matrix with negative values, so that this classifier cannot be combined with Doc2Vec; LSTM-base and LSTM-pool classifiers exclusively work with embeddingLSTM feature extraction and vice versa.Technical details are described in our documentation28.