Introduction

The Noonan (NS, MIM 163950), cardio-facio-cutaneous (CFCS, MIM 115150), Costello (CS, MIM 218040) and LEOPARD (LS, MIM 151100) syndromes are members of the neuro-cardio-facio-cutaneous syndromes group (NCFCS).1 Although the prevalence of these clinically related developmental disorders has not been determined accurately to date, NS is thought to be relatively common.2 The overlapping clinical features among these disorders and the wide spectrum of phenotypes within each trait, associated with the absence of consensus on routinely used diagnostic criteria, make diagnosis of NCFCS quite challenging.1, 3 The diagnosis of such disorders is particularly difficult to perform in utero and in the first years of life, as most features manifest later during childhood.4, 5 Therefore, molecular analysis has become an important key for the verification of clinical diagnosis and represents a highly informative prognostic tool, with direct impact on the establishment of individual follow-up plans.6, 7

Heterozygous mutations in 11 genes associated with the RAS/MAPK signaling pathway have been causally linked to these disorders: PTPN11 (MIM 176876), SOS1 (MIM 182530), RAF1 (MIM 164760), BRAF (MIM 164757), MAP2K1 (MIM 176872), MAP2K2 (MIM601263), KRAS (MIM 190070), HRAS (MIM 190020), NRAS (MIM 164790), SHOC2 (MIM 602775) and CBL (MIM 165360).8, 9, 10, 11 These genes have been documented to account for 70–75% of the NCFCS cases.1 Recently, two additional genes were also associated with these syndromes: RIT1 (MIM 609591) that encodes a member of the RAS subfamily,12 and A2ML1 (MIM 610627) that encodes a complement factor protease inhibitor capable of activating the MAPK cascade.13 Independent A2ML1 and RIT1 variants were identified in individuals with NS without detectable mutations in known Noonan-related genes. In vivo studies of the variants in both genes resulted in developmental defects comparable to defects induced by disease-causing variations in other known NS genes.12, 13

Sanger sequencing is the current gold standard method for the molecular characterization of patients with NCFCS. However, because of the low throughput of this sequencing strategy, the screening of the 13 genes is costly and time consuming. Current molecular testing strategies focus on multigene panels composed of small sets of genes/exons that are selected based on relative frequencies of mutated genes.14 Moreover, the phenotypic overlap of the syndromes, the high number of genes and lack of mutation hot spots makes it difficult to define straightforward strategies of gene selection for sequential molecular testing.10, 15 In this study, the benchtop Ion PGM Sequencer (Life Technologies, Carlsbad, CA, USA) was used to develop and validate a massive parallel sequencing (MPS)-based approach, by a targeted resequencing strategy, for the molecular diagnosis of patients with NCFCS. The development and validation of the MPS-based approach was performed according to the checklist published by the College of American Pathologists,16 and the guidelines developed by the Next-Generation Sequencing: Standardization of Clinical Testing (Nex-StoCT) workgroup17 and by the American College of Medical Genetics and Genomics Working Group on Next-Generation Sequencing.18

Materials and methods

DNA samples selection

Constitutional genomic DNA samples from a total of 35 patients with diagnosis of NCFCS were collected from our archives and used in this study. All samples had been previously screened by Sanger Sequencing for a selected set of exons associated with the clinically suspected syndrome (Supplementary Figure 1). The samples were divided into two groups: (1) training set, composed of 15 selected samples (12 cases of NS, 1 cases of CFCS, 1 case of CS and 1 case of LS), of which 10 harbored known disease-causing variants, used to optimize the strategy; and (2) validation set, composed of 20 random samples (19 cases of NS, including 3 cases of prenatal diagnosis, and 1 case of CFCS) studied blindly and used to validate and evaluate the power of the new methodology. In addition, 100 DNA samples from unaffected control subjects were studied to assess the occurrence and frequency of novel and rare variants (minor allele frequency <2%) of unknown clinical significance. The study was performed in accordance with Article 19 (‘DNA Banks and other biological products’) of Portuguese Law No. 12/2005 of 26 January (‘Personal genetic information and health information’).

Targeted enrichment of genomic DNA

For the gene targeted enrichment, a multiplex PCR-based strategy was designed. The sequences of the 13 genes associated with the NCFCS were retrieved from the NCBI Reference Sequence Database (www.ncbi.nlm.nih.gov/refseq; Table 1). Primer pairs were designed to amplify all coding exons, including 50 bp of noncoding regions flanking the 5′ and 3′ ends of each exon, using Primer3 web-based tool.19 Multiplex PCR reactions were performed following the QIAGEN Multiplex PCR Handbook 10/2010 (QIAGEN, Hilden, Germany) specifications. Each reaction consisted of 2 μl of 10 × Primer mix, 12.5 μl of multiplex master mix, 2 μl of Q-Solution and 50–100 ng of genomic DNA in a total reaction volume of 25 μl. The concentration of each primer in the 10 × Primer mix was adjusted in order to make uniform the representation of the different amplicons in each multiplex assay (primers concentration varied between 1 and 3 μ M). Oligonucleotide primer sequences and PCR amplification conditions are detailed in Supplementary Table 1.

Table 1 NCFCS-associated genes studied

Ion PGM system library preparation and sequencing

MPS was carried out on the Ion PGM Sequencer according to the manufacturer’s instructions. Library preparation was performed according to the protocol Ion Xpress Plus gDNA and Amplicon Library Preparation PN 4471989 Rev. D (Life Technologies). Briefly, for each sample, amplicons generated by the developed PCR multiplex-based strategy were pooled and purified using the AMPure XP reagent (Beckman Coulter, Carnaxide, Portugal). Then, 100 ng of pooled amplicons were sheared by enzymatic digestion for 7 min at 37 °C with the Ion Xpress Plus Fragment Library Kit PN4471269 (Life Technologies). The digested products were purified and adapters or barcode adapters ligated. E-Gel SizeSelect 2% Agarose system (Life Technologies) was used to carry out fragment size selection. The size-selected fragments were nick-translated and PCR-amplified for a total of eight cycles. Following purification, library molarity was determined, and barcoded libraries pooled in equimolar concentrations and used for clonal amplification by emulsion PCR. Template preparation was performed according to the protocol Ion OneTouch System PN4472430 Rev. E (Life Technologies). Quality control of the ISPs was done with the Ion Sphere Quality Control Kit (Life Technologies). The complete ISP sample (for 316 Chip) or half the ISP sample (for 314 Chip) was prepared for sequencing using the protocol Ion PGM 200 Sequencing Kit PN4469714 Rev. C (Life Technologies), loaded on the chip and sequenced on the Ion PGM Sequencer for 130 cycles, using 520-flow runs.

Bioinformatic analysis

Data from Ion PGM System sequencing runs were automatically transferred to the dedicated Torrent Server and processed using platform-specific pipeline software Torrent Suite v3 (Life Technologies) to generate sequencing reads and base-call quality scores, filter low-quality reads, trim adapter sequences and split the reads according to the barcode. Processed reads were stored in the FASTQ file format. The sequence files were aligned to reference sequences previously annotated for the covered amplicons, using SeqMan NGen v4 (DNAStar, Madison, WI, USA). Coverage analysis, variant discovery and base-calling statistics were performed using SeqMan Pro v10 (DNAStar).

A dedicated variant prioritization pipeline (VPP) was developed for data analysis in order to detect all the true positive calls and minimize the number of false positive results.20 The VPP was fine-tuned using all known variants present in the training set as surrogate markers. Genetic alterations observed in exons that were not included in our original strategy (Supplementary Figure 1) were validated by Sanger sequencing. Different filter thresholds related to sequencing calling required for variant identification were weighted: percentage of the single most prevalent nonreference base (SNP% ≥25), phred-like quality score of the called genotype (Qcall ≥40) and depth of coverage (Depth ≥30). An additional interpretative exclusion setting related to the Ion PGM System sequencing error profile was used to minimize the number of false positive calls. Ion PGM System is prone to falsely call single-base indels mostly because of nonlinear pH variation relative to homopolymer length during the DNA polymerization.21 Such sequencing miscalls are nonrandomly distributed, systematic (appear in more than one case and in different runs) and have a strand-specific bias pattern (occur preferentially on one DNA strand compared with the other).20 Calls that fulfill these characteristics were excluded from the analysis. The established VPP was then applied to the validation set. Sanger sequencing was performed in genes/exons that were not included in the previously used multigene panel to confirm all variants in replicate PCR amplicons and to fill-in regions with insufficient coverage. The framework for the validation of clinical molecular genetic tests set by the EuroGentest Validation Group was used to calculate the performance of the new approach.22 For these calculations, the regions sequenced by both the gold standard Sanger and Ion PGM Sequencer were used. These regions correspond to the exons detailed in Supplementary Figure 1 and the amplicons covering variants confirmed by Sanger sequencing (all true positive and false positive calls Ion PGM System calls).

Sanger sequencing

The PCR reactions were enzimatically purified using FastAP Thermosensitive Alkaline Phosphatase and Exonuclease I (Fermentas, Pittsburgh, PA, USA), sequenced with BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Carlsbad, CA, USA) and run on the ABI PRISM 3130xl Genetic Analyzer (Applied Biosystems) according to the instructions of the manufacturers. Variant analysis was carried out with the software Mutation Surveyor 3.24 (Softgenetics, State College, PA, USA).

Sequence variant interpretation

Validated variants were compared with mutations reported in literature as causative of NCFCS and with the data from National Center for Biotechnology Information (www.ncbi.nlm.nih.gov), 1000 Genomes project (www.1000genomes.org), Human Gene Mutation Database (www.hgmd.cf.ac.uk) and NHLBL Exome Sequencing Project (evs.gs.washington.edu/EVS/). In addition, for novel and rare variants (minor allele frequency <2%) of unknown clinical significance, a set of 200 control alleles (100 unaffected subjects) were sequenced by MPS on Ion PGM Sequencer according to the method described above. The functional effects of novel and rare variants that were absent from the control population were predicted using different in silico-based methods: PolyPhen-2,23 MutationTaster,24 SNPs&Go25 and Human Splicing Finder.26 These variants were submitted to a public database (www.ncbi.nlm.nih.gov/SNP/).

Sequence variant nomenclature

The sequence variant nomenclature used in this work is in accordance with the Human Genome Variation Society’s guidelines.27 DNA alteration numbering is based on the cDNA sequence with position +1 corresponding to the A of the ATG translation initiation codon in the reference sequence. Sequence variant descriptions on the protein level consider the initiator methionine as codon 1 and have been checked using the Mutalyzer program.28

Results

Multiplex PCR strategy and sequencing

For the enrichment of all coding regions of the 13 genes a multiplex PCR-based strategy was designed (Table 1). A total of 65.7 kb of genomic DNA was amplified using 158 primer pairs distributed in 12 multiplex and 8 simplex PCR reactions. Amplicon sizes ranged from 217 to 943 bp, with an average length of 413 bp (Supplementary Table 1). Taking into consideration the 65.72 kb of genomic sequence for each sample and the minimum desired coverage per base (30 reads), it was estimated a maximum of 1 sample per 314 chip and up to 6 samples to be barcoded and pooled per 316 chip. The Ion PGM System sequencing output was dependent on the chip-loading efficiencies. For the 314 chip, an average loading density of 61% was obtained that resulted in >350 000 reads. For the 316 chip, an average loading density of 70% was obtained that resulted in >2 000 000 reads. The average read length was 184 bases.

Training set

The training set was primarily used to develop and optimize the analytical wet bench process and the subsequent data analysis pipeline. This set was composed of 15 samples from our archives with known synonymous polymorphisms (n=14) and different disease-causing alterations (n=10) that were used as positive controls. The coverage of all amplicons was ascertained before the development of the bioinformatic pipeline. To avoid the introduction of bias in calculations, the noncovered amplicons were annotated and excluded from the analysis. The post-assembly Ion PGM System data analysis was performed without the application of variant calling filters. Within the list of 2133 calls produced by the Ion PGM System, all the known 24 variants (14 polymorphisms+10 disease-causing variants) were correctly identified. Furthermore, a total of 164 genetic alterations in the genes/exons that were not included in the previously used multigene panel were also identified. To validate these new variants, Sanger sequencing was carried out in replicate PCR amplicons. The validation of these genetic alterations resulted in 188 true variants (24 previously known variants+164 new variants) that were used as surrogate markers to develop the custom bioinformatic parameters for variant prioritization (Supplementary Table 2). Different parameters related to sequencing calling were set in order to determine the thresholds required to filter-in the real variants. To exclude the possibility of false negative results and to minimize the number of false positives, the filter thresholds were set at SNP% ≥25, Q call ≥40 and Depth ≥30. To further attenuate the false positive rate, known limitations of the Ion PGM System sequencing chemistry were used as an additional interpretative exclusion criterion. The application of this VPP to our training set resulted in a list of 211 variants, containing all 188 known variants and 23 additional false positives.

Validation set

The validation set, composed of 20 random samples, was used to validate and independently evaluate the performance of the workflow. Upon Ion PGM System sequencing, the coverage of all amplicons was confirmed in order to exclude the possibility of false negative results. The rare amplicons that did not achieve the established 30 reads of coverage were annotated and re-sequenced by Sanger method (average of 3 amplicons per sample; Figure 1). The Ion PGM System sequencing data were then analyzed in a blind manner through the application of the developed VPP to the 2855 generated calls. Briefly, the sequencing calling analysis step reduced the number of putative variants to 1106. All calls that were known polymorphisms (235) were identified and confirmed by Sanger sequencing. The exclusion of these polymorphisms, followed by the application of the interpretative exclusion setting, resulted in the identification of 38 variants. Validation of these variants through Sanger sequencing resulted in the detection of 24 single-base substitutions. Validation of one of these variants is comprehensively illustrated in Figure 1. The remaining 14 discordant variants were falsely called indels identified by the Ion PGM System, 9 of them in homopolymer regions longer than 4 nucleotides (Supplementary Table 3). To calculate the sensitivity and specificity of the assay, the regions sequenced by both technologies were considered. This resulted in a total of 1676 Ion PGM System calls. The application of the developed VPP to these data correctly filtered out 1403 calls (true negatives), reducing the number of putative variants from 1676 to 273. No true variant went undetected and, of the 273 identified variants, 259 were confirmed by Sanger sequencing and 14 were false positive calls. According to these results, the sensitivity and specificity of this assay were 100% and 99.0%, respectively. The experimental outcome of the validation set is comprehensively illustrated in Figure 2.

Figure 1
figure 1

Illustrative example of the detection and validation of a variant within the SHOC2 gene. The coverage view (top panel) depicts the depth of coverage profile for the coding region of the SHOC2 gene. The orange arrow points to an amplicon (exon 9) with insufficient coverage (<30 reads) that was studied through Sanger Sequencing. The beginning of exon 2 is expanded to show the underlying DNA sequence. The Ion PGM System sequencing (middle panel) shows the sequence alignment of Ion PGM System reads against the reference sequence of exon 2 of the SHOC2 gene. The green and red arrows indicate the forward and reverse reads, respectively (for readability, not all aligned reads are shown). The red box highlights the detected heterozygous missense variant c.4A>G. Sanger sequencing (bottom panel) confirms the genetic alteration. The red arrow indicates the variant position in the electropherogram. The full colour version of this figure is available at European Journal of Human Genetics online.

Figure 2
figure 2

Variant prioritization pipeline applied to the validation set and experimental outcome of the assay. The analysis of the genomic regions covered by both Sanger and Ion PGM System sequencing resulted in a total of 1676 Ion PGM System calls. Sequencing quality filters (SNP% ≥25, Q call ≥40 and Depth ≥30) filtered out 984 calls, reducing the number of putative variants from 1676 to 692. Of these 692 calls, 235 were known polymorphisms that were confirmed by Sanger sequencing. Filtering out the 235 validated calls reduced the number of putative variants to 457. Interpretative exclusion criteria filtered out 419 calls. A total of 38 putative variants were directly obtained from the VPP output. Of these 38 variants, 24 were validated through Sanger sequencing as real and 14 were falsely called indels. Experimental outcome of this validation assay resulted in 259 Sanger-confirmed calls or true positives (TP); 14 miscalls or false positives (FP); 1403 correctly filtered out calls or true negatives (TN); 0 incorrectly filtered out calls or false negatives (FN).

Genetic variants identified in the validation set

Concerning the 259 Sanger-confirmed variants, 236 were in coding regions whereas 23 were in intronic regions (Supplementary Table 4). Only the intronic change c.3347-1G>A in SOS1 had not been reported previously. Of the 236 variants detected in coding regions, 126 were synonymous (63 heterozygous and 63 homozygous), 106 nonsynonymous (46 heterozygous and 60 homozygous) and 4 were heterozygous deletion c.3676_3677delGC (p.(Ala1226Glnfs*34)) within the coding region of A2ML1. Of the 106 missense variants, 83 were common polymorphisms located in A2ML1 gene. Within the remaining 23 missense alterations, 13 were variants that had already been established as causative of NCFCS. The remaining 10 missense variants were novel and rare variants of unknown clinical significance located in the A2ML1, CBL and SOS1 genes.

A known disease-causing variant for NCFCS was detected in 13 patients (12 NS and 1 CFCS). In two of these cases (patients 12 and 13), one rare variant of unknown significance in A2ML1 gene was found to co-occur (Table 2). For the remaining seven patients with no described NCFCS genetic defects, five had new or rare variants of uncertain disease causality. In two of these patients (patients 14 and 18), genetic alterations of uncertain pathogenicity were found to co-occur (Table 2). All novel and rare variants of unknown clinical significance were absent from the 100 control samples. The experimental outcome of in silico prediction tests for these variants is presented in Supplementary Table 5.

Table 2 Genetic profile of the 20 validation set samples

Discussion

The aberrant activation of the RAS/MAPK cascade underlies a number of clinically and genetically related developmental disorders that are grouped within the NCFCS family. Four of these disorders – NS, CFCS, CS and LS – share numerous similarities of phenotypic features, making the differential diagnosis challenging, particularly prenatally and in newborns.1, 5 Until now, heterozygous variants in 13 genes of this pathway (PTPN11, SOS1, RAF1, BRAF, KRAS, NRAS, HRAS, MAP2K1, MAP2K2, SHOC2, CBL, A2ML1 and RIT1) have been documented to be causally associated with NCFCS.3, 11, 12, 13 Currently, because of genetic heterogeneity, only the most prevalent disease genes are assessed in routine molecular diagnosis of the NCFCS. Moreover, the low throughput of current gene-diagnostic tools makes it both challenging and time consuming. Although MPS has the potential to offer a solution for this problem, its implementation in a diagnostic setting is complex as it requires several steps of validation.20

This study presents the validation of a MPS-based approach for the molecular diagnosis of patients with NCFCS and a strategy for its implementation in a clinical laboratory environment. For that, a multiplex PCR-based strategy for the enrichment of the entire coding region of the 13 NCFCS disease genes was designed. A PCR-based strategy was chosen to directly adapt some of the primers and conditions previously used in conventional Sanger sequencing. Although multiplex PCR has been shown to be a robust enrichment technique, it becomes less practical as the target size increases. Capture arrays and the recently available commercial alternatives will facilitate scaling up to larger targets.

The training set was used to optimize the PCR conditions in order to ensure that all exons were sequenced and adjust the representation of the different amplicons in the library pool. Despite efforts for equimolar amplicon pooling, the read-coverage patterns were not uniform along the targeted region. The issue of coverage variability seems to be linked to unequal enrichment efficiencies, library preparation aspects, particularities of local sequence context, alignment issues29 and, in the specific case of Ion PGM System, chip-loading efficiencies. Variability in coverage requires samples to be oversequenced to achieve the minimum desired coverage per base required for variant identification (30 reads).20 The adjustments made in the PCR and sample pooling conditions largely improve amplification uniformity and reduce the number of amplicons below the coverage threshold. Nevertheless, because of unpredictability of the missing amplicons, we observed that the initial step of amplicon coverage analysis was of utmost importance to avoid false negative results. In the rare cases where amplicons could not be properly analyzed, mostly because of PCR skipping or high GC content, they were Sanger sequenced in parallel.

Although MPS provides several advantages over the Sanger method, the amount of data generated by these sequencing techniques poses informatic challenges for data analysis.30 Currently, there is no standard data analysis pipeline and different criteria for variant identification have been used in MPS studies.29, 31 In this study, known true positive variants of the training set were used as surrogate markers to customize the data analysis process and fine-tune the VPP. The basic principle of the VPP was to filter out random sequence errors without the introduction of false negatives. The validation set was then used to assess the power of the new methodology and report its diagnostic accuracy.22 Thus, at a 99% confidence, the maximum analytical sensitivity of the study is ≥98.2% and this means that the probability of a future false negative result is 1.8%. The level of specificity obtained (99.0%) is higher than the levels reported in previous studies concerning the Ion Torrent technology.21, 31 As expected, the false positive calls were mostly single-base indels located in short homopolymer regions with a strand-specific bias. Furthermore, a high number of these errors were observed within motifs of two nucleotide homopolymers, principally CC and GG. As the sequencing chemistry and bioinformatic algorithms continually improve, a reduction in the Ion PGM System error rate in the near future is expected.

In this study, the molecular characterization of the validation set samples revealed 13 known disease-causing alterations and 11 novel and rare variants of unknown significance (10 missense and one in a canonical splice site). Six of these variants were found to co-occur with alterations described as causative of NCFCS or with other uncommon variants. Co-occurring variants in different RAS/MAPK pathway-related genes have already been described in NCFCS patients.32, 33 However, it is not known whether these co-occurring variants lead to more severe or atypical phenotypes. In fact, to predict the disease causality of the novel variants, either alone or in co-occurrence, remains challenging. For a deeper understanding of the effects of novel sequence variants, it would be necessary to assess segregation patterns of the variants, analyze mRNA transcripts and perform functional assays in genetically manipulated cell cultures and animal models. Bioinformatic prediction tools can also be used, but should be interpreted with caution.34

The presented workflow provides a comprehensive genetic screening tool for patients suspected of NCFCS, allowing the detection of sequence variants in previously reported genes regardless of clinical diagnosis. Importantly, the analysis of the 13 genes allowed the identification of 13 NCFCS patients with a previously described disease-causing variant out of the 20 cases of the validation set. If the molecular characterization of these cases had been restricted to the previously used multigene panel (Supplementary Figure 1), only 11 of the cases would have been reported as positive, as the panel did not encompass the genetic screening of RIT1 and SHOC2 genes. As a result, by screening less frequently altered genes, the new methodology increased in 10% the diagnostic yield of the strategy currently used for the molecular diagnosis of NCFCS. Additionally, with the new methodology the costs and turn-around-time were reduced. The time reduction was of particular importance for prenatal genetic diagnosis.4 For instance, the results of the screening of the 20 validation set cases were obtained within 5 days: 2 days for target enrichment and library preparation, 1 day for template preparation and sequencing and 2 days for data analysis and interpretation. In contrast, to screen the same 20 cases using Sanger sequencing, 90 days would have been needed: 2 days for target enrichment and enzymatic purification, 60 days to perform 6320 sequencing reactions and electrophoretic runs (158 amplicons per case in forward and reverse sequencing directions) and 30 days for data analysis and interpretation.

In the current validation study, we designed a combined targeted MPS–Sanger sequencing-based strategy for the molecular diagnosis of NCFCS using the Ion PGM System. This methodology is faster, more comprehensive and more cost effective than the routinely used Sanger sequencing multigene panel-based approach. The study provides practical evidences that a combined targeted MPS–Sanger sequencing-based strategy is suitable for daily clinical practice of genetic testing and can be used as an effective molecular diagnostic tool for multigenic diseases and for prenatal genetic testing.