Introduction

Noninvasive prenatal testing (NIPT) of cell-free fetal DNA (cffDNA) is commonly used in clinical diagnostics for Rh-D blood typing, paternity testing, as well as for the testing of fetal chromosomal aneuploidies like trisomy 13, trisomy 18, and trisomy 21 [1,2,3]. An earlier study discovered that a fraction of cffDNA is detectable in maternal peripheral blood, which initiated the clinical use of noninvasive prenatal testing (NIPT) [4]. However, several technical limitations and lack of robust analytical methods currently limit the broader application of NIPT.

Noninvasive diagnosis in the fetus relies on distinguishing or quantifying fetus-specific DNA from the mixed cfDNA in the peripheral blood. The biological characteristics of circulating cffDNA have been investigated using polymerase chain reaction (PCR) [5,6,7,8]. In addition, cffDNA concentrations were successfully quantified from maternal plasma circulating cell-free DNA (cfDNA) by exploring DNA-methylation patterns [9], placenta-specific mRNA expression [10], Y chromosomes coverage [4], as well as genotype information of single-nucleotide polymorphism (SNP) [11]. With improved technology and reduced cost for higher sequencing depth, the scope of next-generation sequencing (NGS)-based NIPT of maternal plasma DNA is currently expanding to identify fetal copy-number variations (CNVs) and monogenic mutation [12,13,14,15,16,17,18,19,20,21].

Deep sequencing of maternal plasma DNA through NGS has recently been shown to be feasible for detecting CNVs as well as single-gene alterations. Lo et al. [18] recently developed an algorithm named “relative haplotype dosage analysis” (RHDO) to make inference of the local fetal haplotype based on parental haplotypes through linkage analysis, whereby one of the parents is homozygous while the other parent is heterozygous. Kitzman et al. [21] determined the maternal haplotype through large fragment cloning and sequencing and detected paternal-specific alleles through a site-by-site strategy (SBSS). Different reads matching the same paternal-specific allele were counted as evidence of transition from A to B. Fan et al. [19] deduced fetal genomic information through counting the paternal-specific haplotypes following whole-genome sequencing of plasma cfDNA. The prediction of fetal genomic information was mainly based on the allelic imbalance principle.

In order to obtain parental haplotypes, Chen et al. [20] developed a combined strategy of trio and unrelated individuals. The fetal haplotype was noninvasively recovered using a hidden Markov model (HMM) and Viterbi algorithm, with 95.37, 98.57, and 98.45% of the maternal autosomal alleles, paternal autosomal alleles, and alleles from the chromosome X recovered in fetal haplotype. However, all present studies were based on bi-parental haplotypes with a higher cost, a fact that renders clinical application highly problematic.

Obtaining haplotype information using whole-genome sequencing remains expensive and requires complex analytical procedures. As demonstrated by Liao et al. [22], target region capture sequencing (TRCS) achieved deep sequencing of specific regions at reasonable cost. Here, we developed a novel algorithm, pseudo tetraploid genotyping (PTG), which directly deduces the combined genotype of mother and child. To confirm these results, we evaluated the PTG through exome sequencing of fetal and maternal blood cells. Using our PTG algorithm, we deduced the genotype of a miscarried fetus with osteogenesis imperfecta (OI) disease from maternal plasma and a de novo mutation in COL1A1 was identified. The study suggested our newly developed algorithm is able to correctly infer fetal genotype from maternal plasma.

Materials and methods

Sample collection and DNA extraction

The pregnant mother was recruited with informed consent and approval of the Ethics Committee of the Peking University People’s Hospital. The fetus was suspected to have OI according to echograph imaging, as shown in Fig. 1. After obtaining the fully informed consent from the patient, we collected mother’s peripheral blood as well as fetal umbilical cord blood through umbilical vein puncture at week 30.

Fig. 1
figure 1

Real-time graphs of B-mode ultrasonography of the fetus at 26 weeks of gestation. A The yellow arrow indicates the skull region. Echo in this region is reduced, which suggests that the fetal skull is too thin. B The dashed line indicates the shortened bilateral femur, which was only 3.27 cm. C, D The curved tibia and fibula. Yellow arrows indicate the corresponding bone defects. Color figure online

Maternal peripheral blood (5 mL) was collected and centrifuged for 10 min at 4 °C at 1,600 × g. The blood cell portion was centrifuged again at 2,500 × g for 10 min and the plasma portion at 16,000 × g for 10 min, the blood cells portion and plasma samples were immediately stored at −80 °C until further processing [23].

Total genomic DNA from the maternal blood cells and fetal blood cells were isolated using the Amp Genomic DNA Kit (TIANGEN, China). For analysis, 1 mL plasma was extracted using the QIAamp Blood Kit (Qiagen, Germany), and we obtained ~10 ng DNA for library preparation, measured by Qubit 3.0 (Thermo Fisher Scientific).

Library preparation and DNA sequencing

The gDNA samples from maternal blood cells and fetal blood cells were fragmented by using the Covaris S220 (Covaris, Woburn, MA, USA) according to the manufacturer's protocol. Fragmented gDNA and plasma DNA were then used to construct sequencing libraries. Standard protocol was followed including blunt-ending, “A” tailing, and adapter ligation. A total of 17 cycles of PCR were performed to amplify the adapter-ligated DNA fragments. Sequencing ready DNA templates were subsequently hybridized to the SeqCap EZ Probes oligo pool (Roche Nimblegen, USA), with a customary designed 1-M target capture panel covering genetic hotspots of various known diseases. We evaluated the amplified libraries using Agilent 2100 Bioanalyzer (Agilent, USA) and by quantitative PCR. The three libraries with different barcodes were pooled and sequenced (paired-end 2 × 100 bp) using an Illumina HiSeq 2500 sequencer (Illumina, USA) following standard instructions [24].

To eliminate low quality reads, reads with more than 5% N or with at least 50% of all bases’ quality not larger than 30 were filtered out from the raw data with our in-house scripts, and sequencing adaptors were removed. Using the Burrows-Wheeler alignment (BWA) tool [25], remaining reads were aligned to human genome reference sequences (Hg19, NCBI build 37) with fixed parameters (aln -o1-e63 -i15 -L -l31 -k1 -t6). To reduce the bias of PCR amplification, alignment files were further processed with SAMtools [26].

Pseudo tetraploid genotyping

To deduce the fetal genotype from the maternal plasma, we developed a novel algorithm, pseudo tetraploid genotyping (PTG), which jointly considers the combined genotype of both the mother and the fetus. The probability of such a pseudo tetraploid genotype can be calculated by multiplying the probability of any two genotypes or one to itself. However, out of the nine possible combinations, only seven (“AAAA”, “AAAB”, “ABAA”, “ABAB”, “ABBB”, “BBAB”  and “BBBB”) are possible for a mother–child genotype mixture one can recover from maternal plasma used for NIPT. Here, “A” stands for the reference allele, and “B” the most frequent alternative allele other than “A”. The first two letters of each four-letter set represent the maternal genotype, while the last two letters represent the fetal genotype. Among the seven combinations, only four mother–child genotype combinations (“AAAB”, “ABAA”, “ABBB”, and “BBAB”) can theoretically be used to calculate the relative concentration of cffDNA, which is also known as fetal concentration (FC).

The algorithm borrows information from the 1000 Genome Project to empirically learn the population-wide minor allele frequency (MAF), θ, and the prior probabilities of the three genotypes (AA, AB, BB) for any given point mutation.

$$P({\rm{AA}}) = (1 - \theta )^2;P({\rm{AB}}) = \theta (1 - \theta );P({\rm{BB}}) = \theta ^2.$$

We considered the following relationship between FC and the observed MAF and utilized an expectation maximization (EM) algorithm to obtain the optimal FC estimate and the most likely pseudo tetraploid genotype at each mutation point. We hypothesized that the probability density of the observed alternative allele coverage at a given site j follow a negative binomial distribution, \(NB_{gj}(C_{aj} + C_{bj},f_g(FC))\), with the size parameter equal to the total coverage at that site and the probability parameter given by a function of FC for each of the seven pseudo tetraploid genotypes,

$$\begin{array}{c}f_{\rm{aaaa}} = \varepsilon ,f_{\rm{aaab}} = \frac{{{\rm{FC}}}}{2},f_{\rm{abaa}} = 0.5 - {\rm{FC}}\\ f_{\rm{abab}} = 0.5,f_{\rm{abbb}} = 0.5 + {\rm{FC}},f_{\rm{bbab}} = 1 - \frac{{\rm{FC}}}{2},f_{\rm{bbbb}} = 1 - \varepsilon \end{array}$$

where ε is a numeric constant close to zero to represent an infinitesimal effect. The pseudo tetraploid genotype \({\mathrm{G}}_{j}\) with the highest density is assigned to that site. With the updated genotypes at all positions, a new estimation of the FC can be made using the four mother–child genotype combinations (“AAAB”, “ABAA”, “ABBB”, and “BBAB”).

$$\begin{array}{l}{\rm{FC}}_{\rm{aaab}} = 2 \times {\rm{median}}_{{\rm{G}}_{j} = {\rm{aaab}}\left( {\frac{{C_{bj}}}{{C_{bj} + C_{aj}}}} \right)},\\ \hskip 14pt{\rm{FC}}_{\rm{abaa}} = 1 - 2 \times {\rm{median}}_{{\rm{G}}_{j} = {\rm{abaa}}\left( {\frac{{C_{bj}}}{{C_{bj} + C_{aj}}}} \right)}\\ \hskip 16pt{\rm{FC}}_{\rm{abbb}} = 1 - 2 \times {\rm{median}}_{{\rm{G}}_{j} = {\rm{abbb}}\left( {\frac{{C_{aj}}}{{C_{bj} + C_{aj}}}} \right)},\\ \hskip -5pt{\rm{FC}}_{\rm{bbab}} = 2 \times {\rm{median}}_{{\rm{G}}_{j} = {\rm{aaab}}\left( {\frac{{C_{aj}}}{{C_{bj} + C_{aj}}}} \right)}\end{array}$$

The EM process stops when the estimated FC converge to some constant or a maximum number of iterations is reached, together with updated pseudo tetraploid genotype assigned to each site to represent mother–child genotype combination.

Sanger sequencing

To validate the genotype of the potentially pathogenic candidate detected in NGS via PTG, target amplicon was designed to be of 534 bp in length using two primers, 7237F (5′-CCCAACCTAGAGCAGTGGAC-3′) and 7771R (5′CTCCCCCAGGTAGTGGAAAC-3′). The PCR system was set up as follows: 25 µL of total reaction volume, 20 ng of genomic DNA, 2 µL of dNTP (10 mM), 0.3 µL of each primer (10 pmol), 2.5 µL 10 × reaction buffer, 0.3 µL DNA Taq polymerase (5 U/µL), and 18.6 µL PCR-grade H2O. The PCR reaction was programmed as follows: 95 °C for 2 min, 95 °C for 10 s, annealing temperature 60 °C for 30 s, extension at 70 °C for 1 min, 30 cycles from step 2, 70 °C for 10 min; and hold at 4 °C. The reaction was performed using a Veriti Thermal Cycler (Thermo Fisher Scientific, USA). PCR products were tested using 1% agarose gel electrophoresis and purified using a QIAquick PCR Purification Kit (Qiagen, Germany). The amplicon was sequenced using a 3730xl DNA analyzer (Thermo Fisher Scientific, USA).

Results

Case description

The patient is a 28-year-old female with a history of three miscarriages. She scheduled a series of prenatal examinations during the 2nd trimester, including a TORCH test at the 13th week, nuchal translucency (NT) test at week 14, serum screen at week 17, and a B-mode echography at week 26 of gestational age (Table 1).

Table 1 Obstetric examinations of the affected gestation in chronological order

B-mode echography of the fetus demonstrated developmental defects and structural anomalies of several fetal bones, including thin skull, curved tibia and fibula, and shortened bilateral femur (Fig. 1), suggesting high possibility of congenital defects of OI or thanatophoric dysplasia. Following confirmation through prenatal diagnostic testing at 30 weeks of gestation, the mother decided to receive an induced elective termination. Following elective termination, the male fetus exhibited clear signs of OI, including fractured tibia and fibula of the bilateral lower limbs, as well as softened skull. The karyotype of the fetus was 46, XY with no visible chromosomal abnormalities (Supplementary Figure 1).

Pseudo tetraploid genotyping

Using the maternal plasma sample, we mapped ~98-M clean reads to the target regions, resulting in a 97.26% target coverage and a mean sequencing depth of 118×. In addition, we mapped 69-M clean reads to the flanking regions, with a 95.60% coverage and a mean sequencing depth of 59 fold. After joint genotypes calling with PTG, 111,407 loci were detected (Fig. 2B), final fetal fraction was estimated to be 11.59701%. Noting the fact that both biological parents are phenotypically healthy, the disease-related gene most likely carries two possible genotypes in the maternal-fetal combined genotypes: AAAB or ABBB (first two letters for the mother and the second two for the fetus), considering the disease-causing variant is either a de novo autosomal dominant mutation or an autosomal recessive mutation. As a result, 24,416 loci with genotypes either AAAB or ABBB were selected for further analysis. We annotated the resulting 24,416 candidate SNPs with ANNOVAR [27] and sorted the SNPs with a series of filtrating including mutant alleles, variant location, dbSNP annotation [28], population frequency, and variant impact predictions. Only seven SNPs were left after the filtering. Finally, one candidate mutation located in the protein-coding region of COL1A1 (c.2596 G > A, p.Gly866Ser) was considered to be the main triggering mutation for the symptoms. The affected gene has been frequently reported as the genetic cause of OI [29, 30].

Fig. 2
figure 2

Work flow of the pseudo tetraploid genotyping (PTG) algorithm. A (1) Sequencing of the maternal plasma and joint genotype calling with PTG; (2) Sequencing of maternal blood cells and fetal blood cells in umbilical venous and individual genotype calling with GATK; (3) Sanger sequencing of fetal blood cell DNA was performed for validation. B 111,407 variants were jointly called by PTG on the maternal plasma sample. After genotype selection of “AAAB” or “ABBB”, 24,416 loci remained. After extensive filtering, 7 SNPs were identified. The filtering rules are described as follows: (1) Located in UTRs, exon regions, and splicing regions. (2) Excluding mutations in dbSNP (Build135) but not flagged as clinically associated in dbSNP (Build137). (3) Population frequency from 1000 Genome < 0.01; AVSIFT < 0.05; SIFT > 0.9; PhyloP > 0.95; PolyPhen2 > 0.15

PTG evaluation

In order to evaluate the overall accuracy of PTG, we used the combined genotype made up from the GATK called variants of maternal and fetal blood cells as truth [31]. The accuracy rate was defined as the ratio of matched genotypes in the total GATK derived genotypes. In comparison, the detection rate was defined as the ratio of matched genotypes in the total PTG derived genotypes. In our specific OI case, “AAAB” and “ABBB” are two possible combinations of pathogenic genotypes. Taking “AAAB” as an example, combined genotypes of 10,157 loci are consistent between GATK and PTG, whereby combined individual GATK calls resulted in 11,965 loci, and PTG jointly identified 15,811 loci for the mother-fetus combination. Therefore, “AAAB” has an accuracy rate of 84.89% (10,157/11,965) and a detection rate of 64.24 % (10,157/15,811) (Fig. 3A). Similarly, “ABBB” has an accuracy rate of 54.77% and a detection rate of 63.02%. A detailed summary of all seven genotypes is shown in Fig. 3B.

Fig. 3
figure 3

Evaluation of the pseudo tetraploid genotyping (PTG) algorithm. A Detection rate and accuracy rate of “AAAB” and “ABBB”. For “AAAB”, we identified 10,157 matched genotypes, 11,965 combined individual calls made with GATK on the two blood cell samples, and 15,811 joint calls of the plasma sample with PTG. The accuracy rate was 84.89% and the detection rate was 64.24%. Similarly, the accuracy rate for “ABBB” was 54.77% and the detection rate was 63.02%. B The detailed summary of the detection rate and accuracy rate of all seven genotypes. * The two pathogenic candidate genotypes combinations considering a healthy parents–affected child scenario

Verification of COL1A1 mutation by Sanger sequencing

To confirm the mutation in COL1A1 (c.2596G>A, p.Gly866Ser), we performed Sanger sequencing using parental and fetal blood cells (Fig. 4). We amplified the genomic region including c.G2596A using primers7237F and 7771R, yielding a 534-bp amplicon in the three samples. However, only the fetal blood cell group harbored the 5′-G/A-3′ mutation (7237F direction) or 3′-C/T-5′ mutation (Reverse direction, 7771R) (Fig. 4). The mutation in COL1A1 is a de novo mutation unidentifiable in the genome of either parents. Together, the Sanger sequencing results confirmed that the de novo fetal mutation was correctly identified by PTG.

Fig. 4
figure 4

Mutation validation by Sanger sequencing COL1A1 (c.2596 G > A, p.Gly866Ser) was validated by Sanger sequencing in fetal and parental blood cells. A When sequencing with forward primer, only the G/A mutation was detected in the fetal blood cells. B When sequencing with the reverse primer, the C/T mutation was also identified. Green, red, blue, and black curves represent A, T, C, and G, respectively. Color figure online

Discussion

In the current work, we developed an intuitive method (PTG) to noninvasively detect fetal single-gene defects using only maternal plasma. Using this method, we examined the possibility of noninvasive prenatal diagnosis of monogenic diseases.

The overall genotyping performance of PTG was evaluated using genotype calls made individually with maternal and fetal blood cells. When looking at different genotype combinations separately, the algorithm clearly performs better when the maternal allele is homozygous, in which case an overall accuracy rate of 84.89% was achieved, whereas in the cases of maternal heterozygous groups lower accuracy of 54.77% was obtained. Thus, our proposed method performs well for cases of autosomal dominant diseases inherited from the father or pathogenic mutation which occurred de novo.

On the other hand, drawback is being speculated in the case of autosomal recessive diseases, in which case the lower genotyping accuracy renders it difficult to differentiate carrier (ABAB) and affected (ABBB) if the mother is a heterozygous carrier. Since the method assumes strong equilibrium of 1:1 ratio for the allele coverage of heterozygous site, significant deviation from MAF of 0.5 is considered due to the contribution of homozygous genotype at the same locus from the fetus. This issue becomes even more problematic for cases where the fetal fraction is low, rendering the expected deviation from 0.5 even smaller, and the true signal becomes difficult to be distinguished from random variation form 0.5.

Such a phenomenon of unexpected deviation from equilibrium is commonly observed in short read sequencing and is likely due to a selection bias occurring during library preparation and also the not yet sufficient sequencing depth. Not directly related to our problem in sequencing cfDNA, there is a known problem commonly referred to as allele drop out (ADO) occurs in single cell genomic [32, 33]. There, extreme ADO can result in complete failure to amplify one of the two alleles at a heterozygous locus. ADO can affect up to 40% of amplifications in extreme circumstances and result in misdiagnose of diagnostic errors preimplantation genetic diagnosis (PGD) of single-gene disorders [34]. A report by Zhong et al. [34] also indicated that alleles drop out due to amplification bias can result in the low detection yield for SNVs. In addition, DNA damage could also contribute to such imbalance [35, 36]. A report by Chen et al. showed that mutagenic damage represents the dominating factor for the erroneous identification of variants with lower frequency (1 to 5%). Such damage induced sequencing error was found in widely used databases, such as the 1000 Genomes Project [35].

Osteogenesis imperfecta (OI) is a genetically heterogeneous group of connective tissue disorders characterized by increased bone fragility, low bone mass, short stature, and other connective tissue manifestations [37, 38]. OI was originally classified into four types according to clinical and histological manifestations, while the perinatally lethal form (type II) is the most severe. OI classification has now been expanded to 15 different types based on the gene affected and severity of OI phenotypes. Previous studies have shown that the primary cause of OI is mutations in the COL1A1 and COL1A2 genes, which edcode procollagen type I α1 and α2 chains, respectively [39,40,41,42,43]. Also, one-third of the mutations that cause glycine substitutions in COL1A1 result in lethal subtype of OI [37, 44]. Here, we report a de novo variant of COL1A1 gene (c.2596 G > A, p.Gly866Ser) and we consider the variant as the potential cause for the symptoms observed in the fetus. This de novo variant we discovered, and the complete genotype–phenotype reports provide a yet another unique case for future clinical research in the pathogenesis of OI.

The discovery of fetal cell-free DNA in maternal plasma resulted in the invention and development of noninvasive prenatal testing. Furthermore, noninvasive diagnosis of single-gene disorders has been reported for the detection of several monogenic diseases in clinical research [45,46,47,48,49,50,51,52]. Nevertheless, to our knowledge, detection of de novo OI mutations in COL1A1 gene with noninvasive prenatal diagnosis (NIPD) has not been reported previously. Our computational method should greatly assist noninvasive diagnostic procedures in identifying novel mutations, without sequencing of the parents, and also enabling the early detection of similar de novo pathogenic mutation comparing to current echography-based screening. Importantly, our proposed method is highly cost-efficient, as it requires only a single round of plasma screening. In comparison, other approaches rely on haplotyping [16, 20] whereby haplotype blocks must be constructed locally around the known pathogenic mutation derived from the proband. In addition, expanding the panel by adding evenly distributed target regions along the genome should readily unify the standard NIPT for common fetal aneuploidy with noninvasive screening of monogenic disease.