Introduction

Next-generation sequencing (NGS) is now routinely used in many diagnostic genetic laboratories, as it allows simultaneous sequencing of multiple genes for multiple patients, which is a more cost-effective and more rapid approach than Sanger sequencing [1].

The main steps of the NGS technique are: enrichment of DNA regions of interest via DNA hybridization capture or PCR amplification, identification of each patient’s DNA with a barcode, pooling of patient DNA, sequencing by NGS sequencer and bioinformatics analysis of the raw data.

During the NGS preparation step, patient DNA samples are processed in parallel until the barcoding step. This step can be performed at different times, according to the type of NGS preparation technique, but always before pooling of samples to create the library. After sequencing of the library, each amplicon sequenced is assigned to a patient, according to its barcode.

Sequenced DNA is compared to the reference genome and the patient’s variants are listed in order to identify relevant variants, corresponding to the variant calling step. The majority of the variants listed are usually single nucleotide polymorphisms (SNPs). In constitutional genetic analysis, the variants detected are characterized by their allelic ratio (AR): for a given variation, the AR is defined as the (number of reads supporting the variant) / (number of reads at this position). A variant is expected to have an AR around 0.5 when the individual is heterozygous for this variant. It is noteworthy that, in the case of mosaicism, a variant is present in only a portion of the individual’s cells and the expected AR is then <0.5. If the individual is homozygous for the reference allele or for the alternative allele, the expected AR is 0 or 1, respectively.

Sample contamination is a major risk in NGS diagnosis, and needs to be controlled, as series of samples are processed in parallel. Sample contamination can lead to failure of identification of a variant affecting function in the patient, which can be masked by the large quantity of contaminant. Another major risk is to wrongly conclude on the presence of a variant, which actually corresponds to the contaminant. This risk is especially relevant for diseases associated with de novo variant, in which mosaicism can occur.

In order to address this important issue, we have therefore developed a tool (ART-DeCo: Allelic Ratio-based Tool for Detection of Contamination) designed to detect contamination in constitutional NGS analysis. The strategy of this tool is based on the detection of SNPs presenting ARs not usually expected in constitutional analyses. ART-DeCo can be easily implemented in any NGS workflow to control for sample contamination.

Materials and methods

Library preparation and sequencing

Library preparation was performed manually with SureSelect QXT kit on a home-made 384 kb gene panel (Agilent Technologies). The first step of this preparation consisted of dilution of the patients’ gDNA in a 4 × 8 plate (4 columns of standard 96-wells plate). gDNA was then fragmented and adaptor-tagged. The library was purified using Agencourt AMPure XP beads (Beckman Coulter), amplified and re-purified. Samples were hybridized to the capture probes, and then captured using streptavidin-coated beads (Dynabeads MyOne Streptavidin T1, Life Technologies). Libraries were amplified to add barcodes and purified using beads. Libraries were then pooled and sequenced with a NextSeq 500® sequencer (Illumina®).

Nomenclature

For each SNP, a wild-type homozygous sample is indicated as Ref/Ref, whereas a homozygous sample for the alternative allele is indicated as Alt/Alt. Heterozygous samples are indicated as Ref/Alt.

Rationale

This strategy is based on the detection of SNPs presenting unexpected AR for constitutional analyses, i.e. distortions from 0 (homozygous wild-type Ref/Ref), 1 (homozygous alternate Alt/Alt) and 0.5 (heterozygous Ref/Alt) (Fig. 1). For each SNP, the AR of heterozygous samples should be 0.5, but it actually fluctuates around this value, e.g. due to differences in mapping quality between reads with or without mismatches. The heterozygous range was defined as [0.25–75]. Note that this [0.25–0.75] range could be restrained and adapted to the heterozygous distribution if this distribution is known. However, it is not needful. Indeed, homozygous SNPs with 50% maximum contamination level result in AR values fluctuating around 0.25 and around 0.75, whereas lower contamination levels result in AR values <0.25 or >0.75 (Supplementary Table 1, Supplementary Fig. 1). Consequently, the [0.25–75] range allows accurate discrimination between poorly called heterozygous SNPs and sample contamination.

Fig. 1
figure 1

Trimodal distribution of allelic ratios (ARs) of SNPs. AR values for 14 SNPs in 1650 samples were used. For homozygous SNPs (Ref/Ref or Alt/Alt), the observed ARs are [0;0.005 [or] 0.995;1] respectively. For heterozygous SNPs (Ref/Alt), the observed ARs are [0.25;0.75]. SNPs with AR [0.005;0.25 [or] 0.75;0.995] are unexpected in uncontaminated DNA samples

For each SNP, the AR of homozygous samples is theoretically 0 or 1, but in practice is slightly different because of the background noise generated by polymerase, sequencing and alignment errors, index hopping or incomplete trimming of the adaptors (see below AR extraction) [2,3,4]. Background noise is generally low in Illumina® sequencing and expected values are usually observed. As the minimum depth for the SNPs under study was set at 200× (see below SNP selection), the background noise was set at 0.5%, in order to tolerate at least one read as default background. The expected AR intervals of homozygous SNPs used were therefore [0–0.005 [and] 0.995–1] for Ref/Ref and Alt/Alt genotypes, respectively. Consequently, unexpected ratios were situated in the [0.005–0.25 [and] 0.75–0.995] ranges (Fig. 1).

AR extraction

After trimming adaptors by Cutadapt using default parameters [5], reads were aligned via Bowtie2 allowing up to one mismatch in the 22 bp-long seed and reporting only unique alignments [6]. Reads with mapping quality less than 20 were filtered out. Variant calling software was not used, as we wanted to report any frequencies within a focused list of SNPs. The Depth Of Coverage function from the Genome Analysis Toolkit (GATK) was used [7], together with additional statistical analysis detailed below to report ARs of SNPs. To ensure analysis of high-quality data, only base quality ≥20 were considered for determination of the depth of coverage of the selected polymorphisms.

SNP selection

For each sample, the AR distribution of an SNP selection was computed by an algorithm in order to detect, identify and quantify the contaminant (Supplementary Fig. 2).

SNP selection only retained informative polymorphisms with a typical trimodal AR distribution. Among the 628 SNPs, with an European population frequency in the range 0.1–99.9% from the 1000 Genomes database and present in our 60-gene panel, those with recurrent high background noise (e.g. close to homopolymer stretches) were excluded (see “optimization stepˮ section below). Similarly, polymorphisms within paralogous genes were excluded to avoid misinterpretation of AR spoiled by expected misalignment. A total of 547 polymorphisms were then able to be analyzed. Only SNPs with at least 200× coverage were taken into account to allow detection of low contamination. Homozygous SNPs for the same allelic version throughout the samples were non-informative and could not be used for analysis.

Detection of contamination: “worst-case scenario” screening test

The first step of identification of contamination consisted of a screening test for each sample of the run, based on estimation of the “worst-case scenario” (WCS) percentage of contamination. This screening test is independent of background noise and identifies samples possibly contaminated above a certain cutoff, defined as 1% in the present study.

Following the optimization step (see below), the WCS calculation was defined as:

WCS = max(r × 2; (1 – a) × 2); with r = median of the highest 2% of ARs of Ref/Ref SNPs and a = median of the lowest 2% of ARs of Alt/Alt SNPs.

The main advantage of the WCS test is to rapidly rule out any contamination when it is negative. However, it has a low specificity and a positive WCS test must be confirmed by identification of the contaminant, as the worst scenario is never certain.

Identification of the contaminant

Contamination was suspected when the WCS percentage of contamination was ≥1%. The second step consisted of identification of the contaminant in order to confirm the contamination. This identification was based on the SNPs of the contaminated sample (i.e. its genotype) compared to the genotypes of the other samples of the run.

Only homozygous SNPs of the contaminated sample were used, as heterozygous SNPs exhibited excessive variability of AR values to allow reliable identification of small variations corresponding to low-level contamination. Only SNPs with AR <0.25 or >0.75 were used, corresponding to homozygous SNPs (Ref/Ref or Alt/Alt), including contaminated (<0.25 or >0.75) or non-contaminated SNPs (<0.005 or >0. 995, i.e. background noise).

To identify a putative contaminant, the percentage of SNPs compatible with contamination of one sample (A) by another sample (B) was calculated according to the number of homozygous SNPs satisfying the compatibility conditions listed in Table 1. In other words, for homozygous SNPs with expected AR values, the contaminant had to have the same genotype, while for SNPs with unexpected AR values, the contaminant had to have a different genotype. The suspected contaminant was therefore identified by its genotype. For each sample, the other samples from the same run were tested, scored and ranked as putative contaminants and the sample with the highest percentage was considered to be a putative contaminant.

Table 1 Genotype criteria for a contaminant (column 2) according to the genotype of the contaminated sample (column 1)

To determine whether the putative contaminant actually contaminated the sample under study, two conditions were then required. Firstly, the percentage of SNPs compatible with contamination of sample A by sample B had to be higher than the percentage of SNPs of sample A, compatible with absence of contamination; otherwise sample B could simply be genetically similar to sample A. The percentage of SNPs compatible with absence of contamination was the percentage of SNPs with a normal AR <0.005 or >0.995 among the total number of homozygous SNP (i.e. with AR < 0.25 or AR > 0.75) (Supplementary Table 2). Secondly, the putative contaminant had to be significantly more compatible with the contaminated sample than the other samples of the run (Fisher’s exact test with Bonferroni correction for multi-testing, limit of significance 0.05).

Quantification of contamination

As the WCS contamination is only a rough, overestimated value, a refined percentage is calculated following identification of the contaminant and according to its genotype. The contamination percentage of a sample by its contaminant is expressed as the median of the values obtained for calculation of contamination rate for each SNP (Table 2).

Table 2 Calculation of the level of contamination by the contaminant

All samples were collected for diagnostic and genetic counselling purposes. Appropriate individual written consent for genetic analysis was obtained from all the participating patients or their legal guardians.

Availability

This tool is available at: https://sourceforge.net/projects/ngs-art-deco/

Results

Optimization step: dilution ranges

In order to test our strategy and optimize test parameters, two dilution ranges were prepared (dilution A and dilution B) of two DNA samples (A and B) with two other DNA samples (Contaminant_1 and Contaminant_2, respectively). These dilutions created 10 samples with contamination levels of 50%, 25%, 12.5%, 6.3% and 3.1%, respectively (Fig. 2). These contaminated samples were processed with the contaminant samples and 20 other samples to create a 32-well test sample plate (Supplementary Figure 3). This set was sequenced with another 32-well diagnostic sample plate processed separately to mimic routine diagnosis.

Fig. 2
figure 2

Distribution of allelic ratios (AR) values of SNPs in contaminated DNA samples. AR values of the SNPs of the 10 contaminated DNA samples from the two dilution ranges according to the level of contamination (50%, 25%, 12.5%, 6.3%, 3.1%)

The results were used to define the SNPs to be included for contamination analysis, i.e. satisfying the quality criteria defined above and the WCS calculation.

The 10 contaminated samples were detected by the WCS contamination screening test with values higher than 1% (4.92–49.37%), whereas WCS contamination of the other 22 samples of the test plate was <1% (0.35–0.94%). The contaminants were always correctly identified and accurately quantified, as the expected contamination percentages were obtained (Table 3).

Table 3 Detection, identification and quantification of contamination on the optimization set

Routine analysis

This strategy was used for 2227 consecutive constitutional DNA samples sequenced in 36 runs in the context of routine diagnostic genetic testing. Five samples were excluded due to lack of coverage.

Among the 2222 diagnostic constitutional DNA samples tested, 2031 (91.4%) had a WCS less than 1%, excluding contamination.

Thirty-three of the 191 samples with a positive screening test comprised a contaminant identified in the run. These contaminations had WCS contamination estimates of 1.8−42.8% and real quantification based on their contaminants of 1.3% to 29.2% (Table 4). The site of the contaminant of these 33 contaminated samples was located in an adjacent lateral well for 29 cases (87.9%), which was significantly higher than expected (1.7 cases expected (5.2%)) if the identified contaminant was randomly assigned among the other samples of the run (Fisher’s exact test; p < 10–11) (Table 4, Fig. 2). The other four contaminated samples involved two DNA of the same run, but not in adjacent wells.

Table 4 Summary of contaminated diagnostic samples

Among the 158 other samples with a positive screening test, 154 samples had a low WCS contamination below 4%, and four had a very high WCS contamination (38.1%, 43.6%, 49.6% and 23.2%). Contamination of the 154 samples with low WCS could not be confirmed; most probably because this low level prevented unambiguous contaminant identification, but high background noise could remain a plausible option (see Discussion). In contrast, background noise could not explain the high WCS contamination observed for the last four samples (Table 4). For two of these four samples, two different samples were identified by the same barcode: two samples were identified by barcode 65 in run No. 18 and 2 samples were identified by barcode 75 in run No. 19 (WCS of 38.1% and 43.6 %, respectively) (see barcode correspondence in Supplementary Figure 3). The third sample (WCS of 49.6%) was comprised of a mixture of two DNA samples that were supposed to be distributed to two successive wells but were actually distributed into the same well. Lastly, the fourth sample with high WCS (23.2%) corresponded to DNA extracted from the saliva of an allogeneic bone marrow transplant recipient. Contamination therefore reflected the mixture of lymphocyte DNA from the donor and DNA from the cells of the patient’s mouth.

Overall, 36 (1.6%) of the 2222 diagnostic constitutional DNA samples analyzed in this routine diagnostic setting were contaminated during the presequencing steps.

Discussion

Identification of contamination in NGS analysis is important to avoid erroneous diagnostic results, especially when mosaicism is suspected. In this study, we present an easy method to detect contamination in routine NGS constitutional genetic analysis. The screening test with quantification of the WCS percentage of contamination identified possibly contaminated samples above a defined cutoff. The contaminant was then identified to confirm and precisely quantify contamination. Interestingly, this method can be used for any constitutional NGS workflow and can be customized according to the user’s needs.

SNP selection

SNP selection is of utmost importance for successful implementation of this method. In order to avoid false-positives, poor quality SNPs must be excluded, as they frequently give unexpected ARs. It is the user’s responsibility to define a poor quality SNP for the panel, bearing in mind the consequences in terms of specificity.

The number of SNPs required to ensure satisfactory sensitivity depends on their allele frequency (AF) in the population. A SNP with an AF of 50% would be the most informative for the detection of contamination. We recommend including at least 30 SNPs, in linkage equilibrium, with an AF between 30% and 70%, ensuring a 99% probability of having at least 5 informative SNPs. Rare SNPs and SNPs in linkage disequilibrium included in the panel design should obviously be taken into account to consolidate predicted contaminations. For whole exome or genome sequencing, with very high number of SNPs, analysis can be restrained on most covered positions by adjusting minimal depth of coverage.

WCS screening test

Quantification of WCS contamination was performed on all samples of a run. The WCS test is designed to provide the user with a value higher than the actual contamination value (hence the name “worst case scenario”). The actual contamination value is calculated after identification of the contaminant (see “quantification and localization” in the Discussion section below). A p-value could be calculated to highlight a so called “significantly high” WCS”. However, in the event of a highly contaminated run, high WCSs would not be significantly different from one another with a A value close to 1, leading the user to miss the contamination.

WCS calculation has two main advantages: firstly, it constitutes a rapid screening test with a customizable contamination cutoff; secondly, this screening test remains effective even in the absence of contaminant in the plate. These two aspects will be discussed successively.

A 1% cutoff was used in our experiments to demonstrate the performance of the method. However, in clinical practice, a 10% cutoff could be more compatible with the sensitivity of Sanger sequencing, as contamination less than 10% would not be detected by Sanger sequencing [8]. In addition, index hopping in pooled libraries has been observed up to 6% [4] across various methods and Illumina sequencers. Then, those “index-contaminations” might lead to contamination predictions with or without contaminant at low rate. In any case, the user can select any critical cutoff depending on the objectives of the study.

This WCS calculation enables the contamination detection even when the contaminant sample is not present in the run to confirm it. For example, the WCS screening test allowed the detection of the mixture of two different DNAs in the same well before barcoding and barcoding of two different DNAs with the same barcode. A particularly interesting example was a sample from a female patient with a high WCS percentage of contamination (23.22%) with no contaminant identified in the run and no experimental explanation. Surprisingly, an X-linked gene included in the panel showed that this DNA sample more closely resembled a male sample than a female sample. This sample corresponded to that of a woman who underwent allogeneic bone marrow transplant for acute lymphocytic leukaemia 12 years previously. Our method suggested that the donor was likely a man and that the tested DNA sample, extracted from saliva, was composed of a maximum of 23.22% of patient DNA and a minimum of 76.78% of donor DNA. This result was not surprising, as saliva is known to contain lymphocytes [9]. This finding highlights the importance of providing laboratories with relevant clinical information to ensure reliable interpretation of the results.

In 154 samples, a WCS between 1% and 4% failed to identify any contaminant, which could be explained by high background noise and/or too low contaminant level, that could result from index hopping [4], or absence of contaminant in the plate. An AR background noise cutoff of 0.005 was used, so the theoretical lower limit of detection of contamination was 1%. However, in practice, because of the normal distribution of the heterozygous SNP AR, a low level of contamination is associated with a great number of SNPs with AR in the background noise, preventing confirmation of low levels of contamination.

Background noise determines the contamination detection cutoff, which is why detection depends on sequencing protocols used and must be adapted to the user’s specific needs.

Localization and quantification of contaminants

After localizing the contaminant, a refined contamination percentage was calculated, taking into account the genotype of the variants of the contaminant and the contaminated sample.

Thirty-three of the 2222 diagnostic constitutional DNA samples tested were contaminated by another sample on the plate and 6 (0.3%) of them presented clinically relevant contamination ≥ 10% and 27 (1.2%) presented contamination < 10%, deemed to be negligible for constitutional genetic analysis.

As expected, most of the contaminants were located in the adjacent lateral well (87.8%), which is highly suggestive of projection of droplets during library preparation prior to the barcoding step, as many library protocols, including SureSelectQXT, comprise washes that require up-and-down pipetting in the wells of the plates, which sometimes generates droplets that fall onto the plate or into an adjacent well. An understanding of this most common mechanism of contamination is of utmost importance to ensure increased vigilance and optimized practices by the user. Automation of library preparation might reduce contamination but in any case optimization of best practices can be monitored by measuring contamination rates over time, which should theoretically decrease.

Other methods of detection of contamination have already been described in the literature. However, most of these methods were developed for tumor analyses and can require supplemental SNP array data, e.g. the ContEst tool [10]. Alternatively, the Conpair tool [11] does not need SNP array data, but is based on tumor-constitutional pair analyses. Interestingly, Sehn et al. described a haplotype-based tool for tumor analyses [12], which should theoretically also be suitable for constitutional analyses, but with several design constraints, as loci with SNPs in low-linkage disequilibrium are needed to ensure reliable contamination detection, as this tool was developed for the frequent rearrangements found in tumors. Our method is simpler with no such constraints, as rearrangements are not frequently found in constitutional analysis. Lastly, the method described by Jun et al. and Flikinger et al. [13, 14]. also described contamination detection with sequence reads, but based on larger amounts of data (at least 1000 SNPs) provided by massive sequencing such as genome-wide analysis or whole-exome sequencing. Even if such analysis are starting to be more routinely performed, genes panels such as hereditary cancer panels are still widely used for routine diagnostic.

As the proposed method is based on a standard gene panel commonly used in routine constitutional genetic testing, it constitutes a powerful and easy-to-use quality tool with educational benefits, as it also highlights the weaknesses of the process, which is why we believe it should be implemented in diagnostic pipelines as part of the accreditation process. Importantly, it can be used in any NGS workflow, from gene panel to genome-wide analyses.