CAP/ACMG proficiency testing for biochemical genetics laboratories: a summary of performance

Purpose Testing for inborn errors of metabolism is performed by clinical laboratories worldwide, each utilizing laboratory-developed procedures. We sought to summarize performance in the College of American Pathologists’ (CAP) proficiency testing (PT) program and identify opportunities for improving laboratory quality. When evaluating PT data, we focused on a subset of laboratories that have participated in at least one survey since 2010. Methods An analysis of laboratory performance (2004 to 2014) on the Biochemical Genetics PT Surveys, a program administered by CAP and the American College of Medical Genetics and Genomics. Analytical and interpretive performance was evaluated for four tests: amino acids, organic acids, acylcarnitines, and mucopolysaccharides. Results Since 2010, 150 laboratories have participated in at least one of four PT surveys. Analytic sensitivities ranged from 88.2 to 93.4%, while clinical sensitivities ranged from 82.4 to 91.0%. Performance was higher for US participants and for more recent challenges. Performance was lower for challenges with subtle findings or complex analytical patterns. Conclusion US clinical biochemical genetics laboratory proficiency is satisfactory, with a minority of laboratories accounting for the majority of errors. Our findings underscore the complex nature of clinical biochemical genetics testing and highlight the necessity of continuous quality management.


INTRODUCTION
Clinical biochemical genetics is a medical specialty devoted to diagnosing and treating inborn errors of metabolism (IEMs). IEMs encompass a heterogeneous group of genetic conditions including disorders of amino acid (AA; e.g., phenylketonuria), organic acid (OA; e.g., methylmalonic aciduria, MMA), mucopolysaccharide (MPS; e.g., Hurler syndrome), and fatty acid metabolism (e.g., medium-chain acyl-CoA dehydrogenase, MCAD, deficiency). Although individually rare, their estimated combined incidence is as high as 1:800-1:2,500 births. 1,2 Symptoms often appear early in life, but clinical outcomes are optimized by an early diagnosis and therapy instigation before irreversible damage has occurred. 3 An IEM diagnosis is established through high-complexity, laboratory-developed tests performed by a relatively small number of laboratories worldwide, called clinical biochemical genetics laboratories (BGLs). Methods vary between laboratories, but they typically include high-pressure liquid chromatography, gas chromatography/mass spectrometry and/or liquid chromatography-tandem mass spectrometry (LC-MS/MS). There are no US Food and Drug Administration-cleared tests for biochemical genetics testing, a commonality shared with molecular genetics laboratories and DNA-based testing for rare heritable conditions. Despite the nonstandardized nature of laboratory testing, molecular genetic testing quality is very high. [4][5][6][7][8][9][10][11] Nevertheless, biochemical genetic testing poses unique challenges to the clinical laboratory, including the innate task of assessing multiple analytes in a single test, and the inter-and intrapatient metabolic variability due to genotype, clinical status, or other intrinsic and extrinsic factors. Thus, it is vital that laboratories perform extensive validation and ongoing test monitoring to ensure the highest quality for patient care.
Participation in proficiency testing (PT) is associated with improved laboratory quality, particularly in the laboratorydeveloped test setting. [12][13][14] External PT schemes for BGL testing were first introduced in the United States in the 1980s, with a review of outcomes highlighting the importance of ongoing training, education, and quality oversight in the area of IEM diagnosis. 15,16 Comprehensive PT is also offered by the European Research Network for Evaluation and Improvement of Screening, Diagnosis and Treatment of Inherited Disorders of Metabolism (ERNDIM), which reported a positive correlation between participation and laboratory quality. 17 In 1993, the College of American Pathologists (CAP), in collaboration with the American College of American Genetics and Genomics (ACMG), introduced a BGL PT program that included the analysis of AAs, OAs, and MPS. The program was expanded in 2004 to include plasma acylcarnitine (AC) performance.
Here, we present an analysis of the CAP/ACMG Biochemical Genetics Resource Committee PT program from more than 10 years of surveys (2004 through 2014). The analysis was restricted to laboratories that had participated in at least one PT survey since 2010. This was done to allow for estimates of recent laboratory performance, but it also permitted the examination of earlier PT performance in a subset of active laboratories.

BGL proficiency testing survey
The PT survey consists of five specimens distributed twice yearly (an A and B distribution) to laboratories in the United States and abroad. Each mailing contained one specimen for every analysis, including plasma or urine AA, urine OA, plasma AC, and urine MPS. A fifth specimen, excluded from this analysis, is an ungraded educational challenge focusing on less frequently encountered areas of AA, OA, or AC testing. This report includes results from 21 BGL surveys from the 2004-B through 2014-B mailings.
Most specimens were authentic plasma or urine samples from patients with clinically confirmed diagnoses. Occasionally, samples were from unaffected individuals, pooled samples, or spiked with one or more compounds to mimic a metabolic disorder. All plasma samples used had negative test results for hepatitis B surface antigen, hepatitis C antibody, and HIV antigen/antibody. Challenges were pretested by multiple reference laboratories to verify suitability. Specimens were stored at − 20°C or lower, and aliquots were shipped on dry ice. A clinical history accompanied each challenge. Participants reported all diagnostically relevant analyte(s), the most likely diagnosis (clinical interpretation) based on analytic results and clinical history, and by selecting answers from a predefined list. Information about testing methodology and results of analyte quantitation were requested but not required or graded. Laboratories returned results within 30 days via an online result submission form or facsimile.

Data collection and analysis
Data were extracted from the CAP data management system and de-identified. These data encompassed participant results, including those returned after the required submission date and excluded from original Participant Summary Reports. Each challenge's analytic and interpretative results were graded as "Acceptable" or "Unacceptable" based on participant consensus (>80%) or, in a few instances, on the consensus of reference laboratory results. Missing analytic results or missing clinical interpretations were not graded and were excluded from analysis. Members of the CAP/ACMG Biochemical and Molecular Genetics Resource Committee reviewed results and approved grading criteria. Acceptable results were based on either the identification of a wellrecognized, analytical profile, or a pathognomonic analyte level, depending on the circulated specimen, and correlation with the provided clinical scenario. A two-dimensional matrix of the graded responses stratified by time (columns) and participants (rows) was created. These "heat maps" were used to compute analytic sensitivity (the proportion of abnormal analytes correctly identified) and clinical sensitivity (the proportion of disorders correctly diagnosed) for all challenges.
Calculations of clinical sensitivity included separate analyses for two types of false-negative results. One type occurs when participants recognize a specimen as having an abnormal clinical interpretation but report the incorrect disorder (i.e., not the intended response). The second type occurs when participants fail to recognize any clinical disorder and interpret the result as "Normal, unaffected." This latter type is the classic false negative. The MPS survey had multiple challenges from unaffected individuals (six in total) and these were used to calculate analytic specificity (the proportion of normal specimens with no abnormal analytes reported) and clinical specificity (the proportion of normal specimens correctly diagnosed as normal). Rates were compared using the chisquared test with a two-tailed significance level of 0.05. The 95% confidence intervals for proportions were computed using the adjusted Wald asymptotic method.

Participants and tests challenged
A total of 150 laboratories participated in the program for at least one of the AA, OA, AC, and MPS challenges since 2010, including 97 US and 53 international laboratories (North America, South America, Asia, and Europe). Of these, 35 (23%) reported results for all four schemes (AA, OA, AC, and MPS), while 49 (33%) participants reported results for only three schemes (25 for AA, OA, and AC and 23 for AA, OA, and MPS). Among the 38 (25%) participants reporting results for two schemes, the most common pair was AA and OA (26 participants) and among the 28 (19%) reporting results for only one scheme, the most common was AA (22).

AA proficiency testing
The AA challenges included specimens from individuals with phenylketonuria, maple syrup urine disease, various urea cycle defects, cystinuria, and other AA transport disorders (Supplementary Table S1 online). Results from 2013-A were excluded due to sample degradation likely related to storage.  Figure S1). The overall analytic sensitivity for AA testing was 93.4% (1,552/1,661 responses), with the highest sensitivities (100%) observed for conditions related to phenylalanine, citrulline, and arginine metabolism, and the lowest for abnormalities in β-alanine (87.3%) and α-aminoadipic acid (86.2%). Stratified by geographic region, the analytic sensitivity was higher (Po0.001) among US (95.1%, 1,122/1,180) than among international participants (89.4%, 430/481). Among US participants, 53 (58%) had no incorrect responses for analytic sensitivity. Four of these 91 laboratories (4%) accounted for 14 errors (between 3 and 6 each), representing 24% of all analytical errors. Among international participants, 23 (46%) had no incorrect responses for analytic sensitivity; six of these 50 laboratories (12%) accounted for 22 errors (between 3 and 5 each) representing for 43% of all analytic errors.
The overall clinical sensitivity for all participants was 91.0% (1,514/1,663), but the clinical sensitivity was significantly higher among US participants (92.9%; 1,103/1,188) than international participants (86.5%; 411/475) (Po0.001). Clinical sensitivity was highest for more easily recognizable conditions (phenylketonuria, 100%) and lowest for combined homocystinuria/methylmalonic aciduria (2008A, 76.3%). Despite compelling clinical scenarios, clinical sensitivity was also lower for several challenges where the differential diagnosis was broad, including 83.6% for homocystinuria (2006A, cystathionine-β-synthase deficiency) and 84.0% for primary lactic acidemia (2008B). Incorrect clinical interpretations were stratified by incorrect diagnosis (e.g., a sample with elevated citrulline mistakenly diagnosed as ornithine transcarbamylase deficiency rather than citrullinemia) versus "normal" clinical interpretation. This latter group of false negatives represented 2.6% (44/1,663) of all interpretations over the 10-year period. Among US laboratories, the rate of 2.4% (29/1,188) was not different from the 3.2% (15/475) found among international laboratories (P = 0.41). Since 2010, however, laboratories have improved with only one falsenegative result occurring in more than 850 distinct clinical interpretations. A summary of AA performance results along with 95% confidence intervals is provided in Table 1.

I A I A I A I A I A I A I A I A I A I A I A I A I A I A I A I A I A I A I A I
The overall clinical sensitivity was 90.5% (1,301/1,437), and was significantly higher among US (92.9%; 947/1,019) than international (84.7%; 354/418) participants (Po0.001). Interpretive performance was excellent for challenges that included specimens with prominent metabolite elevations and poor for challenges containing specimens with complex or subtle findings such as glutaric acidemia type II (2008B, 85.3%) or isobutyryl-CoA dehydrogenase deficiency (2006B, 75.0%). Among 73 US participants, 36 (49%) had no interpretive errors; 29 (40%) had one or two errors, and one laboratory accounted for nine errors. From 72 total errors, 43 (60%) reflected an incorrect diagnosis and 29 (40%) were false negatives (normal diagnosis). Among 40 international participants, 12 (30%) had no incorrect clinical interpretations, but two had four errors and one had five errors. In this group, there were 64 errors, of which 35 (55%) were incorrect diagnoses but 29 (45%) were identified as normal.
A single OA specimen of fumarase deficiency was distributed in 2005 and 2011, with quantitative data reported by 25 and 30 participants, respectively (Figure 2b). Median fumaric acid concentrations were 451 and 457 mmol/mol creatinine for the 2005 and 2011 mailings, respectively. The Table 1 Estimates of analytic sensitivity for the four schemes included in the biochemical genetic laboratories proficiency testing program stratified by laboratory location  Figure S3). The overall analytic sensitivity was 91.8% (663/722), with 93.1% (471/506) in US laboratories and 88.9% (192/216) in international laboratories (P = 0.083). Analytic sensitivity was 100% for several challenges including MCAD, MMA/propionic acidemia, and a recent very long-chain acyl-CoA dehydrogenase deficiency challenge (2014A). The lowest analytic sensitivity (64.5%) was seen for the 2009A challenge of long-chain hydroxyacyl-CoA dehydrogenase deficiency. Among all 65 participants, 29 (45%) had no analytical errors. One or two errors were made by 19 (29%) and 12 (18%) participants, respectively. Five laboratories (three US and two international, or 8%) reported three or more errors, and these participants accounted for 16 of the 59 total analytic errors (27%).
The overall clinical sensitivity for AC PT was 88.0% (650/739), but this rate was higher among US laboratories (90.1%; 462/513) compared to international laboratories (83.2%; 188/226) (P = 0.010). For three challenges (2005B, 2006A, and 2014A), all laboratories identified the correct abnormal analyte(s) (100% analytic sensitivity) and also provided the correct clinical interpretation (100% clinical sensitivity). Challenges with lower clinical sensitivities either had nonspecific analytic findings compatible with more than one disorder (e.g., elevated C5OH in β-ketothiolase deficiency (77.5%); low free carnitine in carnitine uptake defect (75.7%)), or had only modest analyte elevations, including both challenges containing specimens from known patients with long-chain hydroxyacyl-CoA dehydrogenase deficiency (75.0% and 74.4%). Among the 65 participants, 23 (35%) had no interpretive errors during the 10-year period and 31 (48%) had one or two errors. Eleven laboratories (17%) had three or more errors; one with five and two with six. Among these 89 errors, a subset reported that there was no disorder identified; a false-negative result. The rates for this type of error differed significantly by location (P = 0.001), with a 2.7% (14/513) and 8.0% (18/226) difference in the US and international participants, respectively ( Table 2).
Among challenges constituting a specimen submitted for PT more than once, and for which results included quantitative AC values, concentrations for significant AC results are shown in Figure 2c Table S1). Results were reported for a screening assay for total MPS and/or a fractionation assay (elevations of keratan sulfate, dermatan sulfate, heparan sulfate, and chondroitin sulfate). Overall, 72 laboratories (42 US and 30 international) participated in the PT. Dimethylmethylene blue binding assay (50%) was the most common screening method, followed by toluidine blue (16%), Berry spot test (11%), and Alcian blue spot test (5%). Thinlayer chromatography and electrophoresis were reported among fractionation methods in similar frequencies, but LC-MS/MS increased to 11% by 2014.
MPS performance is summarized in Tables 1 and 2 and in a heat map (Supplementary Figure S1). Two MPS-VII samples (2008A, 2011A) were excluded from analysis due to the lack of participant consensus (o80%). The combined analytic sensitivity for all screening methods was 91.2% (497/545) with similar (P = 0.060) performance by US Table 2 Estimates of clinical sensitivity for the four schemes included in the biochemical genetic laboratories proficiency testing program stratified by laboratory location, along with the rate of the two types of false-negative errors For screening assays, 86% of laboratories (62/72) made one or fewer analytical errors (56% had no errors, 31% had one error). In addition, 10 laboratories (14%) made two or more errors apiece and were responsible for 26 of the 48 errors (54%). For the 49 laboratories utilizing fractionation assays, 33% (16) had no errors and 51% (25) had one error. The remaining eight laboratories (16%) made two or more errors apiece and were responsible for 20 of 45 errors (44%). Of the 56 laboratories reporting interpretations, 70% (39/56) had none or one interpretative error. The remaining 17 (30%) laboratories with two or more errors were responsible for 70% (50 /71) of the errors.

DISCUSSION
The vast majority of PT participants accurately identified diagnostic abnormalities and correctly interpreted corresponding genetic conditions. Performance was excellent for challenges with prominent and characteristic biochemical abnormalities, including samples from patients with classic genetic conditions such as phenylketonuria, MMA, and very long-chain acyl-CoA dehydrogenase deficiency. In contrast, performance waned when challenges involved more subtle analytical findings (e.g., minimally elevated C16-OH AC concentrations seen in long-chain hydroxyacyl-CoA dehydrogenase deficiency), diagnoses relying on the recognition of a complex, multianalyte pattern (e.g., multiple acyl-Co dehydrogenase deficiency) and ultrarare conditions (e.g., mevalonic aciduria). Importantly, the proportion of participants completely missing a definitive diagnosis was low, and the majority of conditions were recognized as having abnormal biochemical profiles, even if the precise diagnosis provided was deemed unacceptable in the PT scheme.
Among the selected laboratories, performance over this study period was steadily higher among US laboratories than international laboratories. The reason for this tendency is unknown. Several factors between laboratories may account for this correlation. In particular, there exist specialized US laboratories (Clinical Biochemical Genetic Laboratories) devoted to the diagnosis and management of IEMs, and the United States has accredited training for clinical biochemical genetic laboratory directors, overseen by the American Board of Medical Genetics and Genomics, that aims to ensure excellence among laboratory directors through continuing education, among other activities. International laboratories may not have access to accredited training programs. Additionally, the distribution of PT specimens may face sporadic delays occurring during shipment to distant destinations thereby affecting sample integrity and impacting interpretation. Indeed, participating laboratories are requested to ensure that specimens arrive frozen and within their storage requirements. Nevertheless, adherence to this request relies on self-reported observations that may be hindered by storage, unpacking, and repackaging at custom office locations. Finally, rare disease incidence differences among populations worldwide may lead to the occurrence that a common genetic condition in one part of the world is seldom observed, if at all, in another region. Hence, laboratories can have divergent experiences with the full spectrum of IEMs. Thus, difference in performance may be due to a combination of these or other factors.
Because multiple factors probably contributed to performance variance for both US and international laboratories, a central function of PT is to prompt assay troubleshooting, process refinement, and quality improvement when failures do occur. This includes prompting laboratories to review their processes for analyte extraction and derivatization, method calibration, and instrument optimization. Analytic errors may also reflect limitations in available laboratory methods, such as for MPS screening and fractionation where assay shortcomings are well recognized 18 and addressed by emerging new LC-MS/MS methods. 19,20 Some errors may result from constraints inherent to the PT process itself, where lower estimates of clinical sensitivity may be due to the lack of supporting clinical information or results of other laboratory tests. This underscores the importance of clinical history when interpreting analytic findings, as pointed out by other groups. 21 Finally, because laboratories submit PT results outside of their normal data transmission and reporting protocols, inadvertent clerical errors may also occur.
Incorrect interpretations also may have resulted from the strict requirement to report a specific diagnosis, even in cases for which additional testing may be warranted in actual practice. For example, a suspected case of MCAD deficiency identified by OA analysis typically requires additional confirmation by AC analysis, and positive MPS screening and fractionation results requires confirmation by enzyme and/or DNA testing. 21 Responses representing an incorrect diagnosis (as opposed to a "normal" response in the face of a true abnormality) were classified as false-negative results for this report. Although considered a PT failure, in practice an incorrect diagnosis still is likely to be followed by repeat testing, additional studies, or gathering of additional clinical or family history. Indeed, some diagnoses included in PT schemes, such as carnitine uptake defect (an AC challenge) or glycine encephalopathy (a plasma AA challenge), may have been particularly challenging since these conditions are more appropriately diagnosed in conjunction with other tests (free and total carnitine and cerebrospinal fluid AA analysis, respectively). Nonetheless, despite the inherent constraints of PT, a correct diagnosis could most often unequivocally be assigned based on results of a single test.
Among other PT programs available for BGLs, the European consortium ERNDIM provides challenges for qualitative OA and urine MPS schemes that are similar to the CAP/ACMG program. As a comparison, reported ERNDIM performance (see http://www.erndim.org/home) for 2015 showed that accuracy ranges from 60 to 89% for MPS and 85 to 100% for OA. Hence, the PT performance of laboratories observed in the CAP/ACMG scheme is similar to those found in other programs.
From this study, current clinical biochemical genetic laboratories demonstrated good overall performance on PT. Remaining challenges for laboratories include the need for standardizing laboratory methods and reagents, and ensuring that ongoing education enhances the awareness of genetic disorders, and their diagnostic profiles, thereby benefiting patient care.

SUPPLEMENTARY MATERIAL
Supplementary material is linked to the online version of the paper at http://www.nature.com/gim