Improving Wikipedia verifiability with AI

Verifiability is a core content policy of Wikipedia: claims need to be backed by citations. Maintaining and improving the quality of Wikipedia references is an important challenge and there is a pressing need for better tools to assist humans in this effort. We show that the process of improving references can be tackled with the help of artificial intelligence (AI) powered by an information retrieval system and a language model. This neural-network-based system, which we call SIDE, can identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowdsourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system’s suggested alternatives compared with the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that SIDE’s first citation recommendation is preferred twice as often as the existing Wikipedia citation for the same top 10% most likely unverifiable claims according to SIDE. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia. The immense amount of Wikipedia articles makes it challenging for volunteers to ensure that cited sources support the claim they are attached to. Petroni et al. use an information-retrieval model to assist Wikipedia users in improving verifiability.


Introduction
Wikipedia is one of the most visited websites on the web (Ranking, 2022), and with half a trillion page views per year (Wikimedia, 2022), constitutes one of the most important knowledge sources today.As such, it is critical that any knowledge on Wikipedia is verifiable: Wikipedia users should be able to look up and confirm claims made on Wikipedia using reliable external sources (Verifiability, 2022).To facilitate this, articles provide inline citations that point to background material supporting the claim.Readers who challenge Wikipedia claims can follow these pointers and verify the information themselves (Piccardi et al., 2020;Lewoniewski et al., 2020;Kaffee & Elsahar, 2021).However, in practice this process can fail: a citation might either not entail the challenged claim, or its source might be questionable.Such claims may still be true, but a careful reader cannot easily verify them with the information at hand in the cited source.Under the assumption that a Wikipedia claim is true, its verification is hence a two stage process: 1) check the consistency of the existing source; 2) if that fails, search for new evidence, primarily online.(1) the claim is sent to the Sphere Retrieval Engine which produces a list of potential candidate documents from the Sphere corpus; (2) the verification engine ranks the candidate documents and the original citation w.r.t. the claim; (3) if the original citation is not ranked above the candidate documents, then a new citation from the retrieved candidates is suggested.Note that the score of the verification engine can be indicative of a potential failed verification, as the one reported in the example.Defined as above, verification of Wikipedia claims requires deep understanding of language and mastery of online search.To what extent can machines learn this behaviour?This question is important from the perspective of progress in fundamental AI.For example, verification requires the ability to detect logical entailment in natural language and to convert claims and their context to the best search term for finding evidence-two long-standing problems that have been primarily investigated in somewhat synthetic settings (Bowman et al., 2015;Wang et al., 2018;Camburu et al., 2018;Nie et al., 2019;Pérez-Rosas et al., 2017;Thorne et al., 2018;Thorne & Vlachos, 2018).It is equally important from a practical perspective.A machine verifier can assist Wikipedia editors by both flagging what citations might trigger failed verifications and suggesting what to replace citations with in case they currently do not support their respective claim.This can be significant: searching potential evidence and carefully reading the search results requires time and high cognitive effort.Integrating an AI assistant into this process could help to reduce both.
In this work we develop Side, an AI-based Wikipedia citation verifier.Side finds claims on Wikipedia that likely cannot be verified given the current citation, and for such, scans a web snapshot for an alternative.Its behaviour is learnt using Wikipedia itself: using a carefully curated corpus of Wikipedia claims and their current citations, we train a) a retriever component that converts claims and contexts into symbolic and neural search queries optimised to find candidate citations in a web-scale corpus; and b) a verification model that ranks existing and retrieved citations according to how likely they might verify a given claim.
We evaluate our model using both automatic metrics and human annotations.To measure the accuracy of our system automatically, we check how well Side recovers existing Wikipedia citations in high quality articles as defined by the Wikipedia featured article class.We find that in nearly 50% of the cases, Side returns exactly the source that is used in Wikipedia as its top solution.Notably, this does not mean the other 50% are wrong but they are not what Wikipedia is currently using as a source.
We also test Side's ability to be a citation assistant.In a user study we present existing Wikipedia citations next to the ones that Side produces.These users then assess to what extent the presented citations support the claim, and which citation-from Side or Wikipedia-would be better for verification.Overall, more than 60% of the time users prefer Side citations over Wikipedia's ones, and this percentage grows above 80% for cases in which Side associates a very low verification score to the Wikipedia citation.

System Architecture
In Figure 1, we provide a high level overview of Side that shows an example of the decision flow given a Wikipedia claim.In the following, we briefly describe all major components of the system and how they interact with one another.Note that we use the term claim to refer to the sentence (or clause) preceding a Wikipedia citation, but any given sentence can contain a multitude of logical claims, and the claim's meaning might depend on its context.The cited documents are represented as a list of passages, i.e., chunks of text with a fixed number of words.

The Retrieval Engine
Given a claim tagged as failed verification by a human editor, or flagged by our verification engine, Side needs to retrieve a list of documents that support the claim.A human verifier would do so by 1) synthesizing a search query based on the claim's context; and 2) executing this query against a search engine.Fundamentally, Side learns to do the same, using both sparse and dense retrieval sub-systems that we explain in more detail below.The claim's context is represented using the sentences preceding the citation, as well as the section title and the title of the enclosing Wikipedia article.We use Sphere (Piktus et al., 2021), a web-scale corpus and search infrastructure for web-scale data, as a source of candidate web pages.Classic sparse and neural dense approaches are known to have complementary strengths (Mao et al., 2020) and hence we merge their results to produce the final list of recommended evidence.
The sparse retrieval sub-system uses a seq2seq model (Lewis et al., 2019;Mao et al., 2020) to translate the citation context into query text, and then matches the resulting query-a sparse bag-of-words vector-on a BM25 index (Robertson et al., 1995;Baeza-Yates et al., 1999;Manning et al., 2008;Robertson & Zaragoza, 2009;Lin et al., 2021) of Sphere.We train the seq2seq model using data from Wikipedia itself: the target queries are set to be web page titles of existing Wikipedia citations.In practice, we enrich the generated queries with the sentence preceding the citation and the Wikipedia title.The dense retrieval sub-system is a neural network which learns from Wikipedia data to encode the citation context into a dense query vector (Wu et al., 2019;Karpukhin et al., 2020;Maillard et al., 2021;Oğuz et al., 2021;Luan et al., 2021).This vector is then matched against the vector encodings of all passages in Sphere and the closest ones are returned.The context and passage encoders are trained such that the context and passage vectors of existing Wikipedia citation and evidence pairs are maximally similar (Karpukhin et al., 2020).

The Verification Engine
Given a claim and possible evidence document, either existing on Wikipedia or proposed by the retrieval engine, a human would carefully evaluate to what extent the claim is supported by the provided evidence.This is the role played by our verification engine, a neural network taking the claim and a document as input, and predicting how well it supports the claim.Due to efficiency reasons, it operates on a per passage level and calculates the verification score of a document as the maximum over its per-passage scores.The verification scores are calculated by a fine-tuned BERT (Devlin et al., 2019) transformer that uses the concatenated claim and passage as input.This architecture is akin to prior work for textual entailment in natural language inference (MacCartney & Manning, 2008), i.e., testing whether a particular premise supports or contradicts a hypothesis.
The verification engine is optimised to rank claim-document pairs in order of verifiability rather than making verified versus failed-verification decisions.This is motivated by the way we envision Side's usage in practical setting: we want to prioritise existing claims for humans to check by starting with those that are less likely supported by their current evidence, and to highlight recommended evidence for a given claim by starting with documents that are more likely to support the claim.To train the verification engine model, we use a training objective that rewards models when they rank existing Wikipedia evidence higher than evidence returned by our retrieval engine.Assuming that some existing Wikipedia evidence is of poorer quality-a core motivation behind this work-even though this training signal could be noisy, we found that, on average, it still provides a meaningful signal.We test this empirically further in the next section.

Evaluation and results
Evaluating the performance of our system is challenging because we cannot be certain that existing citations are always accurate and because of the lack of annotations for citations that fail verification.Therefore, we first evaluate the components of our system in isolation by addressing the following two questions: 1) given a Wikipedia claim, can our retrieval solutions surface the existing citation source from more than 100M web articles?and 2) Is our verification engine able to assign low scores to citations marked as failing verification in Wikipedia?After investigating these two questions, we conduct a large scale human annotation campaign to evaluate the overall system.

Experimental Data and Setting
We collect WAFER, a large scale dataset of English Wikipedia inline citations (≈ 3.8M instances -see table 3 for statistics) which are aligned to a snapshot of the web to obtain the full textual content of the cited sources.Each instance in WAFER contains metadata from the claim's article, the text around the citation within the article (with a marker indicating the citation position), and metadata of the cited source, including title and full textual content (see Figure 5 for an example).We create a cross-validation split on the article level-not on the citation level-to avoid potential test leakage into the training data.
Both the Wikipedia snapshot we consider (i.e., from KILT (Petroni et al., 2021)) as well as the web snapshot (i.e., a CCNet (Wenzek et al., 2019) dump from Sphere (Piktus et al., 2021) which contains 134M web articles, split into 906.3Mpassages) are from August 2019.We use Sphere's web snapshot as the corpus for retrieval.Aligning the citations in the Wikipedia snapshot and Sphere's web snapshot leads to ≈ 250k retrievable citations.From those we sample ≈ 4.5k for testing and development each, making all the cited documents in our test and dev sets retrievable from the Sphere corpus.To increase the size of the training data, we match the remaining unaligned citations in the Wikipedia snapshot against several other Common Crawl snapshots from 2017 to 2019, collecting an additional ≈ 3.5M citations which are not retrievable from the Sphere corpus but which can be used for training models.
We distinguish two types of Wikipedia articles: featured articles (articles, 2022) and non-featured articles.Featured articles are a tiny fraction (i.e., 0.09%) of articles chosen by Wikipedia's editors as examples for their high quality.Therefore, we use the featured articles only for evaluation given their limited number (≈ 16% of test and dev citations).The remaining instances of the evaluation data are sampled from nonfeatured articles which can vary in quality in terms of writing or verifiability.We do not include in these datasets citations marked with a failed verification template (verification, 2022), which indicates that the source does not support what is claimed in the Wikipedia article.We set these citations aside in specific dev and test sets (i.e., fail-dev and fail-test) in order to evaluate the ability of models to detect citations that fail verification.
We use popular retrieval metrics to measure the performance to rank the gold-cited document as high as possible in the retrieved results.As our retrieval is passage-based, the highest ranked passage of a document determines its rank.We consider precision-at-1 (P@1), that is the percentage of evaluation instances in which the originally cited document was ranked in the first position among all retrieved documents.Additionally, we use success-rate-at-k (SR@k)-sometimes also referred to as HITS@k-which is the percentage of cases in which the originally cited document was amongst the top-k documents.We also use the Precision-Recall curve which measures the performance in terms of Precision when Recall is fixed to a certain level.

Retrieval evaluation
We report our results in Figure 2. We note that the sparse retrieval solution outperforms the dense approach for retrieval from the web, which is consistent with previous observations (Piktus et al., 2021).However, we obtain our best overall SR@200 by combining 100 results from each given they are are highly complementary (Mao et al., 2020) (see Figure 2a) -this ensemble is what we use to retrieve passages to feed into the verification engine component.Notably, the verification engine component surfaces the original citation document in the highest-ranked position nearly 50% of the time (see Figure 2b).However, these numbers  (c) Precision versus recall in detecting citations marked as failed verification against citations in featured articles.We compare a passage versus a document-level approach for the verification engine and a baseline using the depth of the cited url.have to be interpreted in the context of our background corpus, i.e., despite containing ≈ 900M passages from more than 100M documents, it can only approximate a real-world scenario where evidences are to be sought on the open web.
In general, retrieving evidence for claims in featured articles is more challenging than for other claims in Wikipedia, e.g., we observe a large difference of -7.0%/-10.4% P@1 (for dense/sparse) between featured and non-featured articles.One hypothesis for this is that there exists an intrinsic popularity bias associated with featured content.Featured content might often correlate with popular topics, which in turn means that more sources on the web contain relevant information.In contrast, claims in more niche articles have much less coverage on the web and therefore are easier to find.Another factor is that featured articles are typically edited a lot more frequently, which is how they achieved their high quality, which in turn also could lead greater deviation from the original phrasing of the cited source.Assuming that dense retrievers are better at recognising paraphrases, we would expect a smaller increase in performance between featured and non-featured for dense vs. sparse, which is indeed the case.
The verification engine model considerably boosts the accuracy of the retrieval component and almost levels the gap for featured articles, suggesting greater ability to identify evidence, even among a large set of relevant documents.This performance can be explained by its ability to leverage fine-grained language comprehension, when the model can directly compare the contents of the two texts using a cross-attention mechanism to overcome the representational decomposability gap suffered by the retrievers (Seo et al., 2019).Another relevant factor is that simple, helpful indicators like quoted phrases from the cited source seem to be easier to detect in token-level comparison.

Detecting Failed Verification
Our goal in this analysis is to measure to which degree the score of the verification engine can be used to detect whether a citation fails verification.To this aim, we rank the union of test citations in featured articles and test failed verification citations.An ideal system would place all failed verification at the bottom end of the ranked list and featured citations at the top.To compute the rank, we consider two different instantiations of the verification engine, that operate either at a passage or document level.As many failed citations include a link to an over-generic URL (e.g., a generic newspaper website instead of a specific page covering the claim), we include a simple baseline based on the depth of a source URL (i.e., the number of  elements in an URL path).In the passage-level solution, we independently compute a score for each passage in a document with the verification engine and rank citations according to the maximum score.For the document-level approach, we feed as much text as possible (i.e., on average the first 2 or 3 passages) for the source document as input to a seq2seq model (Lewis et al., 2019) and use the prediction score for the ranking.
The resulting precision-recall curve is in Figure 2c.Overall, the passage-level verification engine performs very well; if we only consider a conservative Recall of 15%, for instance, ≈ 90% are failed verification citations.Notably, these results are achieved without any explicit supervision on failed verification instances, given that the verification engine is trained only on positive examples.A document-level approach leads to worse results (i.e., ≈ 80% precision at 15% recall), mainly due to the impossibility of considering the whole document (given model architectural constraints on maximum input size).Considering url depth turns out to be a remarkably solid baseline.To further investigate this aspect, we study the distribution of depths for urls in our data (see Figure 7) and find that citations in featured articles tend to be deep (i.e., very specific urls) while citations marked as failed verification are usually shallow (i.e., very generic urls).

Evaluation of the final system
To test the performance of our final system, we perform a two-stage human assessment: (1) a large scale crowd annotation campaign followed by (2) a smaller scale fine-grained evaluation.First, we select claims in the test set for which Side outputs a citation source with a higher score than what is currently on Wikipedia.We then ask crowd annotators to express their preference on which one of the two (i.e., Side's suggested citation or Wikipedia's one) better supports a given claim.Additionally, we ask them to assess if a source contains enough evidence to support the claim, partial evidence (meaning that only parts of the claim are supported by the source), or no evidence whatsoever.To keep the annotation load tractable, we use our verification engine component to select a single passage from each source, making sure to consider overlapping passages for Wikipedia sources so as to avoid cutting evidentiary sentences.
Results are reported in Figure 3.We note that both preferences for Side's suggested source (i.e., Figure 3a) and Wikipedia evidence annotations (i.e., Figure 3b) are proportional to the ranker score associated  to the existing Wikipedia citation-the lower the score the more preferences for Side and the less evidence found within Wikipedia.These results suggest that the ranker score might be a valid proxy for the presence (or absence) of evidence in a citation, and might help in surfacing cases that require attention from Wikipedia editors.To verify the noise introduced by automatically selecting a single passage for each source, we conduct a control study on more than 500 sources where we ask annotators if they prefer the selected passage (i.e., the top scored) with respect to a random one within the source.We find that for over 80% of the cases annotators prefer the selected passage, with an Inter-Annotator Agreement of 0.27 (Fleiss' κ).Finally, to validate crowd annotators accuracy, we annotate more than 100 cases where evidence was not found in the Wikipedia citations.We find (see Table 1 for the complete picture) that sometimes the evidence is in the source but not within the crawled text (e.g., multimedia content); other times, it is spread across multiple passages (which the current system can't detect, but that we plan to tackle in future work).Overall, more than 40% of the time no evidence can be found in the reference to verify a Wikipedia claim.Finally, we build a demo of Side and engage with the English-speaking Wikipedia community, asking users if they would use the citation already present on Wikipedia, the top-1 citation suggested by Side or none of the two to verify a given claim.We do not reveal the source of a citation in the user interface (i.e., Wikipedia or Side), select claim-citation pairs on Wikipedia that are likely to fail verification (i.e., with a verifier score below 0) and allow access to the full text for each citation (instead of a single passage).Results (see Figure 4) reveal that Side can indeed select claim-citation pairs that fail verification -users selected the Wikipedia citation in only 16% of cases, compared to the 65% of citations where either Side's recommendation or neither of the two were preferred.Moreover 26% of the times Side can provide a top-1 recommendation that is judged appropriate by the community.We additionally conduct a sign test between Side and Wikipedia preferences resulting in a P value of 0.0178 and two-tail P value of 0.0357.So far 43 authenticated Wikipedia users2 participated to our study, for a total of 106 annotations, with an average of 1.8 annotations per claim.We plan to keep collecting annotations through our demo and update these figures in future iterations of the paper.

Related Work
There is a large, passionate and engaged community who actively cares about, studies, and works to improve the verifiability of information in Wikipedia.The WikiProject Reliability (Reliability, 2022), for instance, contains a set of tools, resources and reports which are aimed at improving the reliability of Wikipedia articles.One of these tools is Citation Hunt (Gonçalves, 2022), which allows humans to check Wikipedia claims which have been flagged as not being backed by a reliable source and to propose a better citation.We believe the technology presented in this paper can be integrated with similar tools to surface more unverified claims and suggest potential alternative citations to a human to validate.
Text-based classifiers able to detect claim needing citations (Redi et al., 2019;Chou et al., 2020) have received a lot of attention from both the scientific and the Wikimedia communities.We believe Side can be combined with such tools and recommend to Wikipedia editors a set of potential sources for claims needing a citation.Several studies have also been conducted on user interactions with citations (Piccardi et al., 2020;Lewoniewski et al., 2020;Piccardi et al., 2021;Kaffee & Elsahar, 2021;Zagorova et al., 2022) that are tangential to our work.There are a number of papers that approach citation recommendation for Wikipedia from different angles, such as by recommending citations from linked articles (Jana et al., 2018) to citation span detection (Fetahu et al., 2017) amongst other efforts.More broadly, citation retrieval and paper/source recommendation have also received attention in the scientific literature domain for many decades (McNee et al., 2002;Ren et al., 2014;Bhagavatula et al., 2018;Chou et al., 2020), albeit with less of a focus on verifiability of existing citations, with citations drawn from much smaller and less diverse sources than the open web, see Färber & Jatowt (2020) for a recent comprehensive review.
Several works have investigated the ability of AI to generate missing Wikipedia articles from scratch (Liu et al., 2018;Prabhumoye et al., 2019;Fan & Gardent, 2022;Kaffee et al., 2022).There exists AI tools, such as Scribe (2022), that helps editors to bootstrap Wikipedia articles for underrepresented languages using these technologies.The Side engine can complement these systems and provide suggestions of supporting evidence from the web to back the article generation.
Finally, there exist a large body of research focused on fact-checking Wikipedia claims (Thorne & Vlachos, 2018;Thorne et al., 2018Thorne et al., , 2019;;Schuster et al., 2021;Trokhymovych & Saez-Trumper, 2022).However, most of available resources are synthetically created to evaluate AI systems in a controlled environment.We believe that using real world supervision (e.g., from Wikipedia citations) could be key to unlock a larger applicability of these systems.

Discussion
We introduce Side, an AI-based system for improving the quality and verifiability of Wikipedia citations.Building on recent advances in natural language processing, we demonstrate that machines can help humans finding better citations, a task requiring understanding of language, and mastery of online search.While previous works (Bowman et al., 2015;Wang et al., 2018;Camburu et al., 2018;Nie et al., 2019;Pérez-Rosas et al., 2017;Thorne et al., 2018;Thorne & Vlachos, 2018) have shown the ability of large neural networks to perform well on natural language understanding tasks, these results were mostly obtained for well specified tasks, on synthetic datasets specifically created for evaluating AI systems.Here we show similar results in a real world scenario, implying noisier data and a more loosely defined task.
While our results are promising, and we believe our system could already be used to improve Wikipedia, there exist a variety of future research directions that can be pursued.For instance, we only considered references corresponding to web pages, but Wikipedia also cites books, scientific articles and other kind of documents.These include other modalities than just text, such as images and videos.To fully assess the quality of Wikipedia references, Side needs to become multi-modal.Second, our system currently only supports the English language, while Wikipedia exists for more than two hundreds languages.Making Side multi-lingual raises interesting research questions, such as the capabilities of performing cross-lingual citation improvements: given a claim in one language, if the system cannot find good evidence in that particular language, can it find references in other languages?
Finally, our work currently assumes that Wikipedia claims are verifiable, and only improves the quality of the references for existing claims.A natural extension of our work would be to detect claims that are not verifiable, and flag them for review by human editors.This comes with challenges, as a way to show that a claim is unverifiable is to find a contradicting evidence.Unfortunately, Wikipedia currently does not contain such information, and thus training AI-based systems to perform this task is not straightforward.However, we believe that Side could be a first step towards surfacing unverifiable claims: if Side cannot find good evidence for a claim, it might be impossible to verify.We report one example of such claims in the Appendix (Table 2), showing that a lack of good evidence from Side could be an indication of unverifiability.
We release all data, code and models described in this paper.We hope that this work could be used in a broader context than just Wikipedia, for example helping humans to perform fact-checking.More generally, we believe that this work could lead to more trustworthy information online.
Dense retriever DPR is a method that learns to embed queries and documents as low-dimensional dense vectors.The basic building block of DPR is a BERT-like neural encoder, that consumes a sequence of tokens and predicts one dense vector.DPR consists of two such neural encoders, one for the query and one for a document's passage.DPR is then trained on a dataset with instances consisting of (query, correct document) tuples.The training objective is to maximize the inner product between the query vector and the passage vectors of a correct document, and to minimize the inner product for incorrect documents.In contrast to BM25, DPR can learn which parts of the text are likely the important elements.Another advantage is that DPR is typically stronger in retrieving passages with rephrased versions of the claim.
Training Many components of our system, such as the dense retriever and the verification engine, are based on neural networks requiring examples to be trained.We propose to leverage the scale of Wikipedia, and its millions of existing citations, to build a training set for our models.It should be noted that the obtained data is noisy, as existing citations might be failing verification, and determining if it could be used to train our system is an interesting research question.Moreover, our system processes references at the passage level, while our training data corresponds to pairs of claims and documents.Thus, we train the retriever and the verification engine using an expectation-maximization algorithm, modeling the passsage containing the evidence as a latent variable.Finally, our data only contains positive examples of claims and references.A standard solution for training retrievers is to mine negative examples, and we follow this approach here.While this work well for training retrievers, it is unclear how well this supervision would work for training the verification engine, and in particular, to determine if an existing reference is failing verification for a particular claims.Indeed, the problem of ranking a set of candidate documents for a particular claim is different from ranking existing pairs of documents and claims.

1Figure 1 :
Figure 1: The decision flow of Side from a claim on Wikipedia to a suggestion for a new citation is as follows:(1) the claim is sent to the Sphere Retrieval Engine which produces a list of potential candidate documents from the Sphere corpus; (2) the verification engine ranks the candidate documents and the original citation w.r.t. the claim; (3) if the original citation is not ranked above the candidate documents, then a new citation from the retrieved candidates is suggested.Note that the score of the verification engine can be indicative of a potential failed verification, as the one reported in the example.
Percentage of times our retrievers can surface the gold source among the top-200 results, for citations in featured and other Wikipedia articles.The verification engine bar (i.e., green) combines sparse and dense retrievers, 100 passages each.Accuracy in surfacing the gold source in first position, for citations in featured and other articles.The verification engine (i.e., green bar) takes as input a combination of 100 passages from the sparse and 100 from the dense retriever and reranks those.

Figure 2 :
Figure 2: Automatic evaluation of Side components on the WAFER test set.
annotators preference for citations suggested by Side versus those on Wikipedia for a given claim, without knowing their identity.Fleiss' kappa Inter-Annotator Agreement = 0.2.annotations for Wikipedia citations: (1) enough to verity the claim; (2) the claim is only partially verified; (3) no evidence.Fleiss' kappa Inter-Annotator Agreement = 0.11.

Figure 3 :
Figure3: Crowd annotator evaluation for 2016 claims in the WAFER test set for which Side produces a citation with higher evidence score than the existing Wikipedia citation.We collect 5 annotations per claim and report majority voting results, bucketed according to the verification engine score associated with the existing Wikipedia citation (bucket size reported on top).

Figure 4 :
Figure 4: Annotations of Wikipedia authenticated users via our demo.P value = 0.0178.

Figure 5 :
Figure 5: Example citation from the WAFER dataset.
Crowd annotators preference for citations suggested by our System versus those present in Wikipedia for a given claim.Fleiss' kappa Inter-Annotator Agreement = 0Evidence annotations for Side citations: (1) enough to verity the claim; (2) the claim is only partially verified; (3) no evidence.Fleiss' kappa Inter-Annotator Agreement = 0.09.

Figure 6 :
Figure6: Crowd annotators evaluation for 2016 claims in the WAFER test set for which Side produces a citation with higher evidence score that the existing Wikipedia citation.We collects 5 annotations per claim and report majority voting results, bucketed according to the evidence ranker score (bucket size reported on top).