Introduction

Region 4 Stork (R4S) is a collaborative project for laboratory quality improvement of newborn screening by tandem mass spectrometry (MS/MS).1 It began in 2004 and overlapped in time with the delineation of the recommended uniform screening panel.2 Ten years later, R4S has engaged 1,050 users from 64 countries, and the true-positive database has exceeded 17,000 cases, with more than 1.2 million results. The key element of this project, which is now integrated into the Newborn Screening Translation Research Network (https://www.nbstrn.org/), is freely available, on-demand access to postanalytical tools designed to interpret analyte profiles of a single case.3 Tools generate a score reflecting the likelihood of a diagnosis based on condition-specific disease ranges of all informative analytes. Site-specific customization of these tools is available to correct for differences in analyte panels and sample preparation (derivatized versus underivatized method). In a recent study, R4S tools were credited for contributing to more effective diagnostic algorithms for very long–chain acyl-coenzyme A dehydrogenase deficiency.4

Despite more than 120,000 page views, in most instances, these tools have been utilized reactively after one or more analytes exceeded a chosen cutoff value, so their potential to prevent false-positive (FP) outcomes has not been investigated adequately. The impetus for this study comes from concerns over poor specificity that have been voiced repeatedly.5,6,7,8,9 In R4S, the main barrier to a comparative assessment has been the necessity to process one case with one tool at a time. The introduction of two high-throughput portals has provided an opportunity for a comparison between the traditional interpretation model, based on analyte cutoff values, and one based on the scores of the R4S interpretive tools.

Materials and Methods

Study cohort

The population study included 176,186 first specimens of subjects born in California between 1 January and 30 June 2012. The following exclusion criteria were applied to minimize skewing the count of cases with abnormal results based on cutoff-based interpretation: (i) birth weight <1,800 g; (ii) birth weight >1,800 g and age at collection <24 hours; and (iii) birth weight >1,800 g and age at collection >7 days. This study was approved by the Committee for the Protection of Human Subjects of the California Department of Public Health (protocol 13-10-1387).

Current practice in California is to assign profiles that are not completely normal to one of three categories: out of range, review, and presumptive positive. Out of range indicates a single result exceeding the analyte cutoff value but not meeting additional interpretation criteria (e.g., one or more ratios). These findings are considered of negligible clinical significance; they are not reviewed individually, but a notification report is sent to the primary-care provider with no recommendation to pursue follow up and additional testing. A review category is triggered by multiple abnormalities according to the criteria described above. Further evaluation of the overall profile and demographic information is performed by an expert reviewer; in the vast majority of cases the resolution is the same of the out of range category, and a similar report is generated. Otherwise, a review case is escalated to become a presumptive positive, i.e., an abnormal result, and follow-up is initiated.

MS/MS method

The California Department of Public Health screening program relies analytically on an underivatized method based on a commercial kit (StepOne Newborn Screening; Perkin Elmer, Waltham, MA). The analyte panel includes 13 amino acids and 38 acylcarnitines, integrated by 61 calculated ratios (23 for amino acids and 38 for acylcarnitines). The count of active cutoff values is 45, and 22% of them meet the R4S validation criteria.1 Currently, the California Department of Public Health does not routinely perform any of the following second-tier tests (2TTs): (i) homocysteine, methylmalonic acid, and methylcitric acid;10,11 (ii) allo-isoleucine;12 and (iii) 3-hydroxyglutaric acid, glutaric acid, and ethylmalonic acid.

R4S data and tools

R4S is a custom-designed and coded application for the processing of laboratory data based on numerical results.1,2,3 The application uses the Microsoft ASP.NET Framework version 3.5 and Microsoft SQL Server 2008 R2. As of 17 April 2014, the true-positive database includes 17,098 newborns, identified according to local protocols and/or professional guidelines,13 corresponding to 1,209,481 data points. Through an interface called tool builder, this multivariate pattern recognition software creates different types of postanalytical interpretive tools. These tools are designed to provide an evidence-based answer to one of three types of questions: (i) a yes or no situation (the one-condition tool: does a patient have or not have a specific condition?); (ii) a differential diagnosis between two conditions with similar biochemical phenotypes (the dual-scatter plot: e.g., differentiation between very long-chain acyl-CoA dehydrogenase deficiency and very long-chain acyl-CoA dehydrogenase carrier status); and (iii) to answer simultaneously the yes or no question for all conditions for which n > 5 and disease ranges have been established (the all-conditions tool). A defining characteristic of the R4S postanalytical tools is the evolution of clinical validation from the conventional static process, usually performed early during test development, to a constantly evolving, dynamic refinement of the disease ranges that continues to improve throughout the entire test life cycle process. The current count of active tools is 56 one-condition tools, 227 site-specific tools (13 of them created for California), and 24 dual-scatter plots. When a site-specific tool is released, it replaces the general tool for that particular condition when a data set is analyzed using either the tool runner or the all-conditions tool. Another available customization is to recalculate a score after switching from the cumulative reference percentiles to those uploaded by a user’s own site.

Data uploading to R4S

The tool runner simultaneously calculates a score for all available tools for each case within a batch that is uploaded to the website as a comma-separated value (.csv) file void of any patient health information ( Figure 1 ). Analyte results are paired with the corresponding logical observation identifiers names and code (LOINC; http://www.loinc.org/).14 LOINC is a universal code system for identifying laboratory and clinical observations maintained by the Regenstrief Institute at the University of Indiana. This system provides standardized terms for all kinds of observations and measurements that enable exchange and aggregation of electronic health data from many independent systems. A LOINC is unique to a combination of component, system (sample type), scale, and unit of measurement.

Figure 1
figure 1

The process to upload data batches to the R4S tool runner. Details of the design, content, and clinical utilization of the one-condition tools and dual-scatter plots have been reported previously.3 A recorded educational presentation is also available at http://www.mayomedicallaboratories.com/articles/hot-topic/2013/06-15-r4s-pt-3/index.html. R4S, Region 4 Stork.

A 96-well plate includes ~90 patient samples and 9,000 results. After uploading of a .csv file of this size, processing time is typically less than 5 seconds before generating an interactive tabular report of any informative score. The all-conditions tool generates an interactive graphic report of all scores, including zeros, for one case at a time. Both types of reports are linked to each individual tool.

Results

Current status and validation criteria of cutoff values

To establish a baseline for the comparison between performance outcomes, the size and adequacy of the cutoff panels of the California program were compared with those of all participating sites in R4S. Figure 2 shows the number of cutoff values plotted versus the proportion that meets the validation criteria described previously.1 California is slightly below the median for both parameters.

Figure 2
figure 2

The number of active cutoff values ( n = 8,382) used by R4S participating sites ( n = 151). The y axis represents the proportion that meets the validation criteria. A given cutoff value is considered validated when it remains within the target range defined in R4S and also within the 25–75 percentile range of all cutoff values posted by all sites. High and low target ranges are defined by percentiles of the cumulative reference and disease ranges but could be replaced by override values in case of significant overlap.1 Vertical and horizontal dotted lines indicate the median values (active cutoff values = 56; proportion within the validation criteria = 26%). R4S, Region 4 Stork.

Retrospective analysis of the California cohort

The outcome of this cohort is summarized in Table 1 . Case resolution by tool score is based on a simultaneous, parallel evaluation of all conditions, not on a sequential/hierarchical manner. Cases with multiple informative scores are resolved by application to every combination of two conditions of the differential diagnosis tool, the dual-scatter plot.3

Table 1 Comparison between cutoff-based system and R4S postanalytical interpretive tools in California

In the true-negative group, 98.54% showed no informative scores. An additional 642 cases (0.39%) were settled as true-negative outcomes because recalculation of the score after switching to California’s own reference percentiles rendered the score uninformative and/or produced a 0 percentile rank, i.e., a score lower than the lowest known true-positive case detected in California. The remaining 2,422 cases (1.46%) presented with at least one informative score. Because California relies on a network of 15 metabolic referral centers that would likely become aware of any case missed by newborn screening,15 the following analysis is based on the assumption that there were no additional false-negative events during the period under evaluation. Table 1 shows the relative proportions of these cases that could have been resolved conclusively as true negatives on the basis of a total parenteral nutrition score,16 a dual-scatter plot, or interpretation rules. For example, the cumulative experience in R4S with carnitine uptake defect (n = 347) and related maternal cases1,17,18 (n = 168) has shown that a borderline concentration of free carnitine could be disregarded when the sum of propionylcarnitine and palmitoylcarnitine is >2 μmol/l. Reflexing to a 2TT, particularly homocysteine measurement in a large number of cases with a concentration of methionine below the chosen cutoff value,10 would have resolved more than 800 cases. For the remaining cases, a more complete panel of analytes—in particular, the addition of argininosuccinic acid, glutamic acid, and glutamine, the latter two for improved interpretation of cases with a low concentration of citrulline—could have prevented the generation of 20% of the residual scores.1,19 Another avenue of potential resolution could have been verification of an abnormal finding by using a derivatized method, particularly to mitigate the frequent flagging of malonylcarnitine due to either isobaric interference in the underivatized method or the choice of internal standard.20 In summary, utilization of the tools and other interpretive options would not have reclassified any true-negative case to the presumptive positive category.

A total of 7,181 of 8,276 (87%) of combined out of range and review cases could have been prevented completely by combining lack of informative scores and total parenteral nutrition scores. 2TTs could have resolved 513 additional cases. Table 1 shows the distribution of the remaining cases according to the same criteria applied to the true-negative cases.

A total of 454 cases initially reported as presumptive positives were later reclassified as FPs. Again, 66% of them could have been prevented by combining lack of informative scores, a score indicative of total parenteral nutrition, dual-scatter plots, and interpretation rules ( Figure 3 ). Another 14% could have been resolved by a 2TT, with an estimated utilization rate in the entire cohort of 1 in 128 samples. With inclusion of additional markers and verification by derivatization, the residual number of FPs could have been reduced below 10% of the initial count. In this group, there were some findings (e.g., citrulline >300 μmol/l; octanoylcarnitine >1 μmol/l; and glutarylcarnitine >5 μmol/l) that surprisingly normalized by the time of short-term follow-up. Low methionine concentration without the benefit of the 2TT for homocysteine caused 17 FP outcomes. More notably, 62 (11% of all FP cases) were flagged for low citrulline, with a FP rate of 0.035%. However, 28 of them could have been prevented by a noninformative score of the ornithine transcarbamylase deficiency tool. Another 30 could possibly have been prevented by a more complete analysis, especially by the inclusion of glutamic acid and glutamine, allowing the calculation of the respective ratios using citrulline as denominator.1 Although it is not possible to calculate precise performance metrics under these circumstances, the same system proposed here has performed well prospectively in Minnesota over a 6-year period (2008–2013), leading to only 11 FP cases for low citrulline concentration among 431,748 newborns (0.0025%). There were five true-positive cases and one false-negative case over the same period. With one exception, a case confirmed to be affected with carbamoyl phosphate synthetase deficiency, all true-positive cases in the population study were correctly identified. The other false-negative case, affected with ornithine transcarbamylase deficiency, could have been detected by the site-specific tool for that condition, and it is possible that the resolution of both could have been aided by the measurement of glutamate and glutamine.

Figure 3
figure 3

The potential impact of R4S tools, interpretation rules, 2TTs, and other potential interventions on the number of FP cases that were reported in California (January to June 2012). (-) Indicates subtraction of cases from the residual FP count on the basis of the criterion listed next to the symbol. 2TT, second-tier test; FP, false positive; R4S, Region 4 Stork.

Table 2 shows a comparison between actual and estimated performance metrics. Notable improvements could have been achieved in positive predictive value and FP rate. This conclusion is not isolated: since its deployment, the tool runner has been applied 22,238 times by 30 programs (20 international), corresponding to 78 million calculated scores, on average 100,000 calculated scores per day. In terms of effectiveness, our findings are consistent with those that have been observed independently by other programs: a similarly designed comparative study applied to more than 180,000 Swedish newborns reduced the FP rate from 0.07 to 0.05% (Martin Engvall, personal communication). In another comparative study of 96,000 newborns in central Italy, the FP rate declined from 2.3 to 0.6% (Claudia Carducci, personal communication). Finally, the outcome of the R4S tools as the primary mode of prospective postanalytical interpretation in Minnesota for 2013 was as follows: among 71,207 newborns tested, 38 of 55 cases who were referred for follow-up because of abnormal MS/MS results were later confirmed to be true-positive cases, with a positive predictive value and FP rate of 69 and 0.024%, respectively. These figures are indeed very similar to those estimated by the retrospective analysis of the California data.

Table 2 Comparison of performance metrics between cutoff-based systems (actual) and R4S postanalytical interpretive tools integrated with 2TTs, additional markers, and derivatization in Minnesota (actual) and California (estimated)

Discussion

This report focuses on the second generation of R4S postanalytical interpretive tools and on how to use them effectively in a daily laboratory practice in which rapid, large-scale uploading and data processing are desirable. In a previous publication,3 we described how the postanalytical interpretive tools are produced using the tool builder functionality, but there was no mention of their use in a laboratory setting to underscore their clinical utility to a “user” rather than to a “producer” of laboratory results. Indeed, the one-condition tools and the dual-scatter plots can be described as clinician tools, applicable to aiding decisions about case resolution, test utilization, and therapeutic interventions. This was done deliberately to raise awareness of situations in which clinicians would have to reconcile the referral for follow-up of a case that in R4S would be considered not informative. Such a situation is less than ideal, suggesting that the application of the interpretive tools should occur at the earliest possible time, before a decision is made to proceed with further evaluation, including the request to submit a repeat specimen. This is the first report of the functionality that allows the simultaneous evaluation of all conditions with an active tool. Once an informative score has been generated, the all-conditions tool serves as an effective gateway to an unbiased set of differential diagnoses. This design allows a user to simultaneously and rapidly explore multiple one-condition tools, even those with a score of zero that was triggered by differentiator and outlier rules.3

Because additional conditions are likely to be added soon to the recommended uniform screening panel, improved performance is necessary to assure proper allocation of existing resources and avoidance of unnecessary distress to the patient’s family.21 Most true-positive cases detected by MS/MS analysis of amino acids and acylcarnitines present with almost self-evident profiles that should not constitute a diagnostic challenge for the traditional cutoff-based system. The critical issue to be addressed is how to prevent the occurrence of FP cases driven by a rigid application of statistically defined cutoff values, especially when all potentially informative markers and ratios are not considered.

Over the past decade, the introduction and routine application of 2TTs to verify MS/MS results have encountered mixed reactions. To date, 44% of participants who have provided this information as part of their profile in R4S (n = 84) perform one or more 2TTs on site, 20% outsource them, and 36% do not use them. Although there is a steady trend toward greater utilization, persistent barriers include the need to set up additional MS/MS methods that also require chromatographic separation, lack of instrumentation, the perception of delayed reporting and added cost to the laboratory, and last, but not least, the risk of overutilization. However, a workload of less than 1% should be manageable by most programs, especially when regionalization of services is considered to optimize resource utilization and quality of testing. Moreover, dedicated applications in R4S for each 2TT are or will soon be available where objective, large-scale comparison between true-positive and FP cases will further reduce the number of cases that still remain unresolved after the performance of a 2TT.

This study has provided new insight into the feasibility of newborn screening for a number of conditions with a biochemical phenotype based on low concentrations of amino acids. On the one hand, the verification of a low methionine level by a 2TT for homocysteine is extremely effective (no FPs and three confirmed cases in Minnesota since 2009), and a case could be made for the inclusion of remethylation disorders in the recommended uniform screening panel,22 considering the benefits of early identification and intervention.23,24 On the other hand, the California experience has shown that without 2TTs, targeting low methionine does cause many FP events, often not preventable by the R4S tool. The identification of proximal urea cycle disorders by low citrulline concentration has been regarded as a challenge in terms of sensitivity and specificity, also considering the growing spectrum of possible secondary targets.25,26 However, the observed FP rate in the Minnesota population has matched the average performance of all primary analytes that have caused FP outcomes over a 6-year period (0.0025%; range: 0.0002–0.0218%; n = 24) for the conditions detected by MS/MS. Therefore, once the availability and effectiveness of early intervention27,28 has been recognized as evidence of net benefit, the nomination of ornithine transcarbamylase deficiency and carbamoyl phosphate synthetase deficiency for inclusion in the recommended uniform screening panel should also be considered.

Disclosure

G.M., R.J.C., D.M.S.M., and P.R. are coinventors of a provisional patent application (Computer-based dynamic data analysis, US 61/722,761). The other authors declare no conflict of interest.