Predictive values, uncertainty, and interpretation of serology tests for the novel coronavirus

Antibodies testing in the coronavirus era is frequently promoted, but the underlying statistics behind their validation has come under more scrutiny in recent weeks. We provide calculations, interpretations, and plots of positive and negative predictive values under a variety of scenarios. Prevalence, sensitivity, and specificity are estimated within ranges of values from researchers and antibodies manufacturers. Illustrative examples are highlighted, and interactive plots are provided in the Supplementary Information. Implications are discussed for society overall and across diverse locations with different levels of disease burden. Specifically, the proportion of positive serology tests that are false can differ drastically from up to 3%–88% for people from different places with different proportions of infected people in the populations while the false negative rate is typically under 10%.


Background
This section outlines key statistical definitions related to serology testing. Readers familiar with testing characteristics may skip to "Results". Definitions are included in the Supplementary Information. Additional details on these concepts and examples relevant to serology may be found elsewhere [25][26][27][28] .
Two properties of serology tests quantify how well the tests perform in on samples in a lab with known antibody status, Sensitivity is the probability that a serology test correctly classifies a sample with antibodies for SARS-CoV-2. According to the FDA, sensitivity of a test refers to its "ability to identify those with antibodies to SARS-CoV-2 " and "can be estimated by determining whether or not it is able to detect antibodies in blood samples from patients who have been confirmed to have COVID-19 with a nucleic acid amplification test. " Specificity is the probability that a test correctly classifies uninfected samples as lacking antibodies for SARS-CoV-2. Similarly, the specificity of a test refers to its "ability to identify those without antibodies to SARS-CoV-2" and Prevalence estimates. Due to the lack of available diagnostic tests in the US, official counts of  cases are likely undercounted 29,30 . Additionally, people with asymptomatic infections are unlikely to seek medical care or diagnostic testing and are likely excluded from official counts. Consequently, reliable prevalence estimates are limited. Prevalence, which affects predictive value estimates, can be considered unknown, and varies over time.
Emerging research is beginning to estimate population prevalence. Over the period from March 31 to April 7, by one estimate 31 , the prevalence by state ranged from about 0.4% in Alaska, Hawaii, Kentucky, and West Virgina to 8.5% in New York with a median prevalence of 0.9%. Another estimate 30 from April 11 found infection proportions within states spanning from 0.1% in rural states to 7.0% in New York, and an overall US prevalence of 1.2%. A third group proposes a method 29 with estimates that could indicate a prevalence of up to 10% of the population as of April 4, 2020. A collection of case studies is highlighted in "Application to specific locations". In brief, the prevalence of Covid-19 in specific cities, states, and counties estimates during the early stage of the pandemic in the spring of 2020 ranged from less than 1% to about 30% in especially hard hit areas, such as Boston and New York City. According to more recent estimates as of September 2020 by the US Centers for Disease Control and Prevention 32 , most states had seroprevalence estimates ranging from 1% to 10%. As of the writing of this paper, only four states have estimates below 1% and five have estimates exceeding 10%.
General interpretation in the context of antibodies testing. In this section we investigate ranges of plausible values of sensitivity, specificity for antibodies tests available and and prevalence for relevant areas and compare the predictive values. Predictive values are of interest based on minima of 80% sensitivity and 90% specificity to reflect reported values for FDA-authorized serology tests and described in "Analysis of individual antibodies tests". In general, the false negative rate is low and false positive rate is highly variable for available serology tests. Figure 1 is a plot of NPV for these specificity, and sensitivity values and prevalence ranging from 1 to 30%, the range currently reported elsewhere as discussed in "Prevalence estimates" and "Application to specific locations". Under these scenarios, the minimum NPV was 91.3%, indicating that the false negative rate was less than 10% in all scenarios. Thus, NPV should be high, and FNR should be low for all serology tests within similar ranges for sensitivity, specificity, and prevalence. In other words, negative serology tests have a high likelihood of accurately reflecting a lack of antibodies in the general population of non-infected individuals.
By contrast, Fig. 2 shows a corresponding plot spanning a wide range of plausible PPV values. PPV increases with prevalence and is low with lower rates of antibodies in the population. At fixed prevalence values, specificity also quickly increases PPV with prevalence. Higher sensitivity improves PPV, although the growth of PPV with increases in sensitivity is less pronounced than with increases in specificity at a given prevalence. For example, an area with 10% prevalence would have 47.1% PPV for a test with 80% sensitivity and 90% specificity, 66.7% PPV for a test with 90% sensitivity and 95% specificity, and 91.3% PPV for a test with 95% sensitivity and 99% specificity. Equivalently, the false positive rates would be 52.9%, 33.3%, and 8.7%. In areas with a 30% infection rate, the same tests would yield respective PPVs of 77.4%, 88.5% and 97.6%, and false positives of 22.6%, 11.5%, and 2.4%. Yet, if the prevalence is 1%, then PPV could reach 49.0% for 95% sensitivity and 99% specificity or fall as low as 7.5%, indicating that only 7.5% of people with positive serology tests in fact possess antibodies. Put another way, for tests with 80% sensitivity and 90% specificity in locations with 1% prevalence, about 93% of people with positive serology tests would be expected to lack antibodies for SARS-CoV-2! Analysis of individual antibodies tests. As   , specificity (0.9-1), and sensitivity (0.8-1). The bar on the right with sensitivity is denoted by color, with lighter colors denoting lower sensitivity and darker colors denoting higher sensitivity. All parameters are reported as decimals ranging from 0 to 1. NPV values exceeded 0.9 for all input parameters. 3), specificity (0.9-1), and sensitivity (0.8-1). Sensitivity is denoted by color, with lighter colors denoting low sensitivity and darker colors denoting higher sensitivity. All parameters are reported as decimals ranging from 0 to 1. PPV varied widely based on different parameters, ranging from 0.07 to 1.     www.nature.com/scientificreports/ Figures 3 and 4 highlight the relationship between PPV and prevalence for each of the 11 serology tests granted an EUA by the US FDA in the spring of 2020 that retained approval in November 2020. Similar figures for the remaining 50 tests are provided in the Supplementary Information. The figures illustrate the known relationship that PPV should be lower in populations with lower prevalence 28 , and that PPV increases more rapidly with increasing specificity than with increasing sensitivity.
At low prevalence, such as 1%, many tests exhibit low PPV and high rates of false positives ( Table 1). The upper limits of false positive rates for 45 (74%) tests exceed 60% and 49 (80%) tests exceed 50% given 1% prevalence. Thus, in regions with 1% prevalence, scenarios having 50% or more percent of positive serology tests corresponding to people lacking SARS-Cov-2 antibodies would be within the realm of reasonable expectations. In addition, some tests with estimated specificity below 97% have not only poor estimated false positive rates, but report high precision. Tests by Cellex, Megna, Biohit, JiangSu, and TBG are estimated to have about 80% false positives at 1% prevalence, with even the lower bounds on their FPR exceeding 66%. In such low prevalence populations, on average, anywhere between 6 and 9 out of 10 positive tests by these companies are expected to be false.
Results for all tests improve with prevalence, but overall false positive rates remain concerning. Although only nine point estimates for false positive rates exceed 20% assuming 10% prevalence, which is higher than most currently estimated infection rates [30][31][32] , the upper bound for false positive rates exceed 20% for 39 out of 61 tests. Moreover, only 15 (25%) tests from six companies-Abbott (3 of 4 minus Alinity IgG, Ortho-clinical (IgG and PanIg), Siemens (all 6 varieties), BeckmanCoulter (IgG and IgM), BioMerieux (IgG) and Roche-have upper bounds on false positive rates below 10% when the prevalence is 10%. In other words, 46 out of 61 tests could reasonably produce at least 1 false positive out of 10 positive tests if the regional prevalence is 10%.
Estimates or confidence bounds of exactly 100%, reported for many , are shown for completeness. These estimates should be interpreted with caution, as tests establishing sensitivity and specificity were done on small samples 33 33 . Interval estimates for the false positive rates are minimum and maximum values of all possible false positive rates calculated at the specified prevalence level for the corresponding test based on all possible estimates of sensitivity and specificity within the 95% confidence intervals. * n refers to the number of samples used to calculate the pretest probabilities. The number on the left refers to the number of samples with infected with SARS-COV-2 that were tested to estimate sensitivity. The number on the right refers to the number of control samples used to estimate specificity. Test types are shortened as followed. "G" denotes IgG, "M" denotes IgM, "C" denotes combined IgG/IgM, and "P" denotes Pan-IgG.  36 , indeed has high PPV and low false positive rates even at 1%, with an upper limits for the false positive rate of 25.2%. The Roche test is the only test available by the end of May 2020 that could have reasonably claimed to expect more true positive results than false positive results in populations with low prevalence (1% or less). However, it would have still been reasonable to expect that up to one quarter of positive test results could have come from patients lacking antibodies to SARS-COV-2. Given the sample size 33 of 29 for calculating sensitivity, the associated uncertainty could be compounded in the estimates of PPV and FPR, rendering the upper bounds critical to measure and understand. Even 6 months later, only one test, (Simens Atellica IgG) has improved on this figure, with the smallest upper confidence limit for FPR of 24.5% at 1% prevalence and a corresponding FPR point estimate of about 9%.

Application to specific locations
In this section, we evaluate PPV and FPR for a set of areas with three local studies of seroprevalence, in California, New York, and Boston. When rigorous prevalence estimates are unavailable, we use the proportion of positive tests as surrogates for prevalence for the purpose of estimating the rates of false positives in these studies. Rationale for and limitations of this approach along with a sensitivity analysis are discussed in the Supplementary Information.  G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G     Predictive values in the Santa Clara Study nearly spanned the entire range of probabilities. Using prevalence values reflecting prior estimates ranging from 0% to 5%, the PPV in Santa Clara County at the time of the study ranges from 0% to 96% (Fig. 5). Prevalence near the low but nonzero end of the updated estimates (e.g. 0.2%) 38 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 19.9% in New York City. We estimate the PPV for all combinations of values reported in all of these sources assuming the same serology tests were used in both studies. Figure 6 shows the range of PPV based on each of these possible values of sensitivity, specificity, and prevalence. New York City and Long Island had the highest prevalence and highest PPV, ranging from 74% to 98% and 60% to 97% in all scenarios. Statewide and other areas are in the middle: 60%-96% for Rockland, 65%-97% for statewide. By contrast, PPV can be as low as 30% for the rest of the state, assuming a prevalence 3.6%, if the specificity is 93% or as high as 87% if sensitivity is at the upper limit of the confidence interval reported in the EUA 33 . Even worse, areas with low prevalence 42 such as North Country (1.2%), Central NY (1.9%), and Capital District (2.2%) had PPV estimates ranging from 12% to 69%, 18% to 78% and 21% to 81% In other words, the false positive rate in New York ranged from 2% to 88% depending on the region and assumed prevalence under consideration and uncertainty in the sensitivity and specificity.

(e) Roche Pan-Ig
Chelsea, Massachusetts. Researchers at Massachusetts General Hospital conducted a seroprevalence study in the city of Chelsea 22 and reported that 31.5% of participants tested positive. The researchers on the Chelsea study reported specificity exceeding 99.5%. However, the manufacturer, BioMedomics, which is not part of the labs with EUA in "Analysis of individual antibodies tests", reports a sensitivity of 88.66% and specificity of 90.63% 43 .
Assuming prevalence of 31.5% and specificity and sensitivity values reported by Biomedomics 43 , PPV was only about 81%. This means that in this sample of 63 positives, about 51 are expected to be true positives and about 12 are expected to be false positives. Thus, the prevalence estimate in Chelsea reported in the press based on this sample alone is likely to be too high. In addition, participants were recruited by a convenience sample of people outside on a particular street, which may not be representative of the general population in Chelsea.  Figure 5. PPV for Antibodies test in Santa Clara County Study 37 . To reflect variation within and between references [37][38][39] , prevalence values are included from 0% to 5% by 0.2%. www.nature.com/scientificreports/ negative mental and physical effects of social isolation as well as economic effects to individuals and society overall. Fortunately, the false negative rate was under 10% in all scenarios. Unfortunately, the false positive rate can be shockingly high. Based on the prevalence estimated throughout the US and serology studies in California, New York and Boston, the FPR of antibody test results range from 2% to 88%. Point estimates of tests ests with an EUA 44 reached 86% and upper limits reached 93% when the prevalence is 1%. Tests with low PPV and high FPR can be dangerous by giving patients with positive tests a false sense of security. Ironically, these people may then increase their risk of contracting Covid19 if they relax their use of protective measures, such as mask wearing and social distancing.
The timing of the test may impact the result, as discussed in the Supplementary Information. Briefly, seroconversion is the process during which antibodies develop after infected by Covid19 become detectable in the blood; the seroconversion duration could complicate the consideration of interpretation of serology test results. After infection, patients took about 3-40 days to develop detectable antibodies 45,46 . Typically, after 14 days, most patients will develop antibodies. If the testing period is within 14 days, the sensitivities of the tests will be lower.
The number and implication of false positives is growing with large organizations encouraging widespread serology testing. Quest Diagnostics is offering tests by Abbott, Ortho-Clinical and Euroimmun for purchase 47 .
The tests can have up to 84% FPR in locations with 1% prevalence. Even at 10% prevalence, over one-third of positives could be false. OneBlood, a non-profit blood donation and distribution organization, is encouraging large scale blood donation and then testing samples for antibodies using the OrthoClinical total test 48 . As discussed in "Analysis of individual antibodies tests", the OrthoClinical test could have a false positive rate of over 50% with 1% prevalence or nearly 10% if the prevalence is 10%. Further, OneBlood is planning to use blood samples testing positive for antibodies as convalescent plasma. In fact, on August 23, the FDA granted EUA for convalescent plasma in patients hospitalized with Covid-19 49 . As shown in this paper, single serology tests of the general population in low prevalence areas could yield a large number of false positives, which could inadvertently harm patients. Using all samples that test positive could mean that large numbers of samples falsely considered to contain convalescent plasma which would become inadvertent placebos for patients actively struggling with Covid-19.
More recently, with the impending approval of vaccine candidates, there is a need to determine the prioritization of groups throughout the dose distribution process. In some discussions 50 , patients who previously contracted covid-19 could be considered lower priority for receiving vaccine until after the uninfected public. Given the false positives discussed in serology tests in this paper, it would be imprudent to determine past disease status by widespread serology testing alone. In this case, uninfected individuals who falsely test positive on serology tests would be denied the vaccine in a timely manner and therefore would have to wait longer for the opportunity to vaccinate and protect themselves from the virus. We recommend against using single serology tests to screen for prior infection. In fact, the length of protection from reinfection is unknown for individuals with prior exposure either to the disease itself or via vaccination.
One recommendation for individuals who test positive is to consider a follow-up antibody test 51 . For instance, if the FPR is 50%, assuming independent tests, the probability of two false positives drops in half (25%). The FDA includes a calculator for PPV of individual and combined tests 33 . Follow-up tests are common in other diseases with low PPV screening methods, such as mammograms for breast cancer 52 . For instance, one could use a highly sensitive test with sensitivity/specificity, say, 99% and 90 %, respectively, as the initial test, and then followed up a highly specific test with sensitivity/specificity of 90% and 99% respectively, as the second test. Then, the FPR (i.e., 1-PPV) would drop from 81.9% to 10% for a place with prevalence at 1%. For a place with high prevalence at 10%, the FPR could drop from 47.6% to 1% Another idea is to test all contacts for antibodies and use their results as evidence to support or refute the original serology test. Pursuing contacts of additional seropositive individuals may increased contact tracing and testing, which can either hinder growth of future outbreaks or divert scarce resources from higher risk contacts 25 . Moreover, increased testing brings cost and feasibility concerns.
There are some additional limitations of our paper. For instance, some of the information may become outdated quickly. Tests operating under an FDA EUA will likely increase over time. For instance, Abbott Alinity was added to the original 12 tests made available by FDA under EUA in early May 2020 while writing the first draft of this paper. At the time, the FDA noted that at least 160 serology tests were available before the FDA increased its oversight 53 . This implies some of the antibody tests on the market might even have lower sensitivity or specificity than those included in this paper and therefore have even higher FPR. Indeed, between the original paper submission in June 2020 and this revision in December 2020, two tests, AutoBio and ChemBio, had their FDA EUA revoked 54,55 . Notably, valaues for BioRad and Ortho-Clinical changed, and Diasorin added an IgM test along with its earlier IgG test. Similarly, if an infected patient takes the test before antibodies are developed, then the sensitivity will be lower. An extended discussion is included in the Supplementary Information. At the same time, prevalence may increase over time at least for some of the cities or towns. A reference to prevalence estimates calculated by the CDC and current as of the writing of this revision is provided 32 . Importantly, neither the results nor the interpretation for serology tests generalize to diagnostic tests. The Supplementary Information provides a brief discussion.
In conclusion, serology tests for the novel coronavirus generally have low false negative rates and highly volatile false positive rates. While false positive rates decrease with increasing prevalence, current prevalence estimates remain low in most areas of the US as of the writing of this manuscript. With increasing serology testing and likely increased reporting of testing results, it is critical to understand these values and interpret test results properly. We hope that this context and interpretation can aid doctors, patients, researchers, and policy makers in informed decision making, which may even save lives.

Methods
We collected reported sensitivity and specificity values of serology tests with EUA approval by the US FDA. Prevalence estimates were also collected to determine an appropriate range for the plots. These values were combined to produce estimates of PPV and NPV for a variety of input parameter values. We then honed in on specific FDA tests and areas that have conducted serology tests to provide estimates and uncertainty for PPV and false positive rates. All programs utilized R version 3.6.1 56 . PPV and NPV were calculated using package MKmisc 57 . Plots were created with packages plotly 58 and ggplot2 59 . Figures 1 and 2, as higher dimensional plots, were designed to allow interactive visualization. Code to generate the plots and view them in an interactive mode may be downloaded from our github repository at https://github.com/nbrownst/AntibodiesPredictiveValues. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.