PKD1 Duplicated regions limit clinical Utility of Whole Exome Sequencing for Genetic Diagnosis of Autosomal Dominant Polycystic Kidney Disease

Autosomal dominant polycystic kidney disease (ADPKD) is an inherited monogenic renal disease characterised by the accumulation of clusters of fluid-filled cysts in the kidneys and is caused by mutations in PKD1 or PKD2 genes. ADPKD genetic diagnosis is complicated by PKD1 pseudogenes located proximal to the original gene with a high degree of homology. The next generation sequencing (NGS) technology including whole exome sequencing (WES) and whole genome sequencing (WGS), is becoming more affordable and its use in the detection of ADPKD mutations for diagnostic and research purposes more widespread. However, how well does NGS technology compare with the Gold standard (Sanger sequencing) in the detection of ADPKD mutations? Is a question that remains to be answered. We have evaluated the efficacy of WES, WGS and targeted enrichment methodologies in detecting ADPKD mutations in the PKD1 and PKD2 genes in patients who were clinically evaluated by ultrasonography and renal function tests. Our results showed that WES detected PKD1 mutations in ADPKD patients with 50% sensitivity, as the reading depth and sequencing quality were low in the duplicated regions of PKD1 (exons 1–32) compared with those of WGS and target enrichment arrays. Our investigation highlights major limitations of WES in ADPKD genetic diagnosis. Enhancing reading depth, quality and sensitivity of WES in the PKD1 duplicated regions (exons 1–32) is crucial for its potential diagnostic or research applications.

(2019) 9:4141 | https://doi.org/10.1038/s41598-019-40761-w www.nature.com/scientificreports www.nature.com/scientificreports/ causing a frameshift after position 12655 (c.12627_12655dup; Phe4219Cysfs). In family 2, we detected a novel missense mutation in exon 41 caused by substitution of thymidine to cytosine at position 11524 that results in tryptophan substitution with arginine at position 3842 of the protein (c11524T > C; p.Trp3842Arg). The mutation showed a score of 1 using Polyphen-2 and 0 using SIFT, which suggests pathological significance. In family 3, we detected a novel PKD1 mutation in exon 41 caused by duplication of the cytosine at position 11863, resulting in a frameshift at position 3955 that leads to a premature stop codon because of the substituted amino acid at  www.nature.com/scientificreports www.nature.com/scientificreports/ position 6 (c.11863dupC, p.Gln3955Profs*6). As this mutation is predicted to cause a frameshift, we define it as pathogenic. In family 4, a novel deletion of 9 bases was detected in exon 40 of PKD1 (c.11339_11347del9), resulting in the deletion of 3 amino acids from the protein (p.Asp3780_D3782del). In family 5, we detected a two-base deletion mutation in exon 15 (c.5014_5015delAG). This deletion caused a frameshift from position 1672, leading to a premature stop codon at position 98 after the substitution (p.Arg1672fs98*). This particular mutation is the most frequent mutation detected in ADPKD families 9,15 . In family 6, a known nonsense mutation was detected in exon 15. This truncating mutation resulted from a single base change (c.6727 C > T), causing a premature stop codon p.Gln2243* 10,28 . Identification of PKD1 and PKD2 variants using WES. WES of the samples from the 26 ADPKD patients (known to have ADPKD before study enrolment) identified 27 variants in PKD1 (including 6 novel variants) and 5 variants in PKD2. Of the 27 PKD1 variants, 12 were synonymous changes, 12 were non-synonymous changes and the remaining 3 were insertions/deletions. For PKD2 variants, only 1 synonymous and 4 non-synonymous changes were identified ( Table 2). Five of the 6 ADPKD mutations identified using Sanger sequencing were identified by WES, whereas the remaining mutations (c.5014_5015delAG, found in family 5) could not be detected in the 8 patients included in WES analysis from this family (Tables 1 and 2). In addition, the mutation found in family 6 (16:2158441-SNV, c.6727 C > T) was detected only in one patient out of 6 included in the WES analysis from this family. While 4 of the 5 ADPKD mutations detected using WES were frameshift insertions and deletions, the remaining reported mutation was a missense mutation (16:2141795-SNV, c.11524 T > C; www.nature.com/scientificreports www.nature.com/scientificreports/ p.Trp3842Arg), and its pathological impact was evaluated and confirmed using the pathological prediction and conservation scores (Supplementary Tables 1,2).
Overall, the sensitivity of WES for detecting PKD1 mutations throughout the length of the gene, which reflects the ability of the test to correctly identify the true disease-causing mutations, was 50%; however, for mutations in exons 1-32, it was only 7.14% and for exons 33-46, it was 100%. As there were no false-positive results recorded, the specificity of WES for detection of PKD1 mutations was 100% for the entire length of the gene. As all the ADPKD mutations in enrolled patients were found in PKD1, sensitivity and specificity for PKD2 could not be calculated ( Table 3).
WES of exons 1-32 of PKD1 showed low reading depth, genotype quality and quality. All variants detected in exons 1-32 of PKD1 showed low mean reading depth (RD) ranging from 2 to 5 (Figs 2 and 3). Despite a low RD of 2, the mutation found in family 1 (16:2158441-SNV) was called and detected, but only in one of the 6 patients from this family (Tables 2 and 4). We also calculated the total number of reads per PKD1 exon in each sample, and all samples showed a similar trend (Fig. 4). Exons 1-32 had a noticeably lower total number of reads in comparison to exons 33-46. Mean Genotype Quality (GQ) and mean quality were also low for variants called in exons 1-32 in comparison to variants called in exons 33-46. However, the mean Mapping Quality (MQ) was roughly the same for all variants called across the whole PKD1 gene. When the quality was normalised by depth, called variants in exons 1-32 showed a relatively similar range of scores to those variants called in exons 33-46 ( Fig. 2).
For PKD2, all exons were properly covered, and all called variants showed a mean RD > 10, mean GQ was >60, mean MQ was close to 60 and mean quality ranged between 179.42 and 1576.08. Exon 3 showed the lowest total number of reads in all samples (Figs 2, 4 and 5).
WGs detected PKD1 mutations located in the duplicated region. DNA samples of 4 ADPKD patients were selected for WGS analysis: 2 patients with PKD1 mutations in the duplicated region of PKD1 and 2 patients with mutations outside the duplicated region of PKD1 (mutations were previously identified using LR-PCR). The coverage analysis of WGS for PKD1 and PKD2 showed good and uniform coverage, including exons 1-32 of PKD1 (Figs 3 and 4). Three out of 4 mutations were successfully detected using WGS, including the 2 mutations located in the PKD1 duplicated region (exon 15), c.5014_5015delAG and c.6727 C > T, whereas c.12627_12655dup29 (located in exon 46) was missed (Table 5).

Targeted enrichment system detected mutations in PKD1 duplicated regions. Samples with
PKD1 mutations from families 5 (c.5014_5015delAG) and 6 (c.6727 C > T) both located in exon 15, which were found using LR-PCR and Sanger sequencing but could not be detected with WES, were reanalysed using targeted enrichment arrays. These specific mutations were identified in all the samples analysed using targeted enrichment of PKD1. Coverage maps of targeted enrichment of PKD1 and PKD2 showed good coverage and depth in the coding regions of both genes (Figs 3 and 5). The newer version of SureSelect capture arrays shows significant improvement in the representation of exons 15-32 of the PKD1 gene (Fig. 3). However, capture of exons 1-14 remains poor.

Discussion
Mutations in either PKD1 or PKD2 may cause ADPKD. While genetic analysis of PKD2 is relatively easy, genetic analysis of PKD1 is more complex because of its highly polymorphic nature, large size and pseudogene regions. LR-PCR and Sanger sequencing are the current gold standard methods for genetic analysis of PKD1. However, these methods are labour intensive and require a substantial amount of time to analyse the large number of amplicons. In the current study, we evaluated the efficiency of WES, WGS and target enrichment as genetic diagnostic tools for ADPKD and as potential replacement methods for the LR-PCR and Sanger sequencing techniques.
ADPKD is diagnosed using ultrasonographic age-related cyst number criteria 29 . It has also been shown that computed tomography and magnetic resonance imaging may be effectively utilised for the same purpose 30 . These image-based diagnostic approaches are highly reliable in older patients (aged >30 years) but not in younger adult patients, which limits their utility in kidney donations to exclude disease status in young kidney donors 11 . Genetic analysis of ADPKD genes provides an alternative reliable diagnosis tool for younger patients. In the current study, these methods demonstrated high sensitivity and specificity, as they detected all the pathogenic mutations that segregated with the disease in all patients.
Recent advances in genetic sequencing, including NGS platforms such as WES, are considered revolutionising tools for genetic diagnostic research. The ability of WES to generate accurate, efficient, fast and cost-effective genetic data is important when considering implementation in medical practice to improve diagnosis and disease treatments in general. For ADPKD, WES is considered an effective tool to explore and identify potential disease modifier genes, particularly where our current knowledge of these modifiers is poor [31][32][33] . In this study, we assessed the effectiveness of utilising WES as a diagnostic tool for ADPKD and whether it can replace the current gold  www.nature.com/scientificreports www.nature.com/scientificreports/ standard methods for ADPKD genetic diagnosis: LR-PCR and Sanger sequencing. To achieve this objective, we avoided filtering the called variants and analysed the quality scores for all variants called in PKD1 and PKD2. WES showed major limitations when used to detect mutations in PKD1 exons 1-32, as test sensitivity for this particular region was only 7.14%, whereas it demonstrated high sensitivity for the remaining coding regions of PKD1. This was also reflected by the low RDs observed, particularly over PKD1 exons 1-32 (Figs 2 and 3). The low RD concurred with the significantly lower number of total reads per exon in this region compared to the PKD1 exons 33-46 and PKD2 (Fig. 5). Moreover, the mean GQ for the called variants in PKD1 exons 1-32 was correlated with the read depth pattern, as they scored low in comparison with the variants called elsewhere in PKD1 and PKD2. GQ indicates the likelihood of the called genotype being correct; the higher the value, the more accurate the genotype calls. Normally, a GQ > 20 is considered acceptable 33 ; however, in our WES results, the mean GQ for called variants in PKD1 exons 1-32 ranged from 6 to 72 (Table 4). Despite the low GQ score of 6 and a RD of only 2 for the called mutation found in family 1 (16:2158441-SNV, c.6727 C > T), the WES results were confirmed by LR-PCR and Sanger sequencing, indicating its validity. However, the mean MQ scores across PKD1 and PKD2 were high in general, indicating that all the reads were informative and correctly mapped. Although there is no clear cut off MQ value 34 , a score > 20 is usually acceptable according to the GATK tool 35 (Table 4). These poor results may be attributed to the PKD1 pseudogenes, as designing capture oligonucleotide probes (65 www.nature.com/scientificreports www.nature.com/scientificreports/ nt) that specifically target PKD1 genuine regions and avoid its pseudo-regions poses a considerable challenge. As a result, pseudogenes are captured in parallel with the genuine PKD1 regions, complicating data analysis, compromising result reliability and often yielding false-positive results, as mutations in pseudogenes are detected, resulting in reduced test specificity 21,27,36 . However, it is possible that readings mapped to multiple genome sites tend to be avoided, as demonstrated by our results. In general, the low coverage and RD of the PKD1 exons 1-32 could represent a major diagnostic limitation, as some mutations may be missed and consequently compromise the sensitivity and efficacy of the test. In both WES and WGS, mutation detection sensitivity can be improved by increasing the depth of sequencing and by the utilization of newer capture arrays. The newer version of SureSelect V6 capture arrays shows significant improvement in the representation of the exons 15-32 of the PKD1 gene (Fig. 3). For example, SureSelect V6 arrays covers all exons with a target size of 60 MB and 758,086 probes, while V4 of SureSelect capture arrays have a target size of 51 MB and covers 334,378 exons.
WGS showed a more uniform representation of entire PKD1 and PKD2, including exons 1-32 of PKD1. It has been suggested that longer read lengths and avoidance of capture bias enhanced the ability of WGS to detect pathological ADPKD mutations, including those in the duplicated region of PKD1 37,38 . However, detecting small to medium size indels (10-1000 bases) remains a challenge (Table 5). We are currently utilizing three algorithms in parallel; Pindel, UnifiedGenotyper and HaplotypeCaller to call indels 39 .
Target enrichment of PKD1 resulted in successful detection of mutations located in the duplicated regions (c.6727 C > T and c.5014_5015delAG, both located in exon 15), which could not be detected efficiently using WES. This coincides with the good coverage and RD produced using target enrichment over all coding regions of PKD1, including the problematic exons 1-32 and PKD2. One of the key differences between WES and target enrichment is gDNA library preparation and enrichment. For WES, sequence enrichment is achieved through www.nature.com/scientificreports www.nature.com/scientificreports/ an oligonucleotide probe-based capture strategy, whereas a PCR-based method is used for target enrichment [40][41][42][43] . While the oligonucleotide probe-based capture strategy for larger DNA regions is preferred over PCR-based methods for time and cost reasons, its capture efficiency is between 70-80%, as exons with high GC or AT content have reduced hybridisation and amplification efficiency 19,44 . In addition, a common limitation for capture-based enrichment methods is the presence of pseudogenes, which results in complications in variant calling and data interpretation of the targeted regions 19 . However, PCR enrichment methods are more efficient and reliable when analysing genes with pseudo-regions such as PKD1, as they have shown high specificity, sensitivity and reproducibility 45 . In addition, the ability to modify the design of probes in the PCR method provides another advantage, as this allows specific targeting of the functional gene rather than the pseudogenes, avoiding diagnostic limitations, including false negative results caused by low coverage of PKD1 duplicated regions (as seen in WES) and false-positive results caused by the inability to avoid pseudogenes and mutations in these regions. Target enrichment of PKD1 appears to overcome the limitations of WES and perhaps is more suitable, now, for ADPKD diagnostic applications.
In conclusion, the ability to effectively implement WES in current medical practices will improve care of patients by enhancing disease diagnosis and treatment planning. WES provides a rapid diagnostic tool for many genetic diseases and disorders, as it allows identification of common and novel genetic variants that may then be evaluated for their pathological impact. As for ADPKD, although the WES platform successfully identified novel PKD1 mutations, it showed low sensitivity and RD of the duplicated regions, which represent a challenge for effective and reliable genetic diagnosis of ADPKD. WES enrichment strategies must be improved to solve the low sensitivity problem in the PKD1 duplicated regions. Such enhancements would allow more rapid and accurate genetic analysis of ADPKD patients, which in turn would contribute to better disease management and improve our understanding of the molecular pathology underlying the disease.   www.nature.com/scientificreports www.nature.com/scientificreports/ Methods Patient inclusion criteria. Six families with a history of ADPKD were selected from the Nephrology unit database at Mubarak Al-Kabeer Hospital, Kuwait for inclusion in the current study. ADPKD patients showed typical clinical presentations of ADPKD, including multiple renal cysts and impaired kidney function. Total 51 individuals from 6 families with typical ADPKD, including 26 ADPKD patients and 25 at-risk individuals were enrolled in the current study, which was reviewed and approved by the Joint Committee for The Protection of Human Subjects in Research of the Health Sciences Center (HSC) at Kuwait University and the Kuwait Institute for Medical Specialization (KIMS) (Reference: VDR/JC/690). Written informed consent was obtained from all patients before their enrollment in the study. All methods were performed in accordance with the guidelines of the joint HSC and KIMS ethical committee.

Clinical evaluation.
Clinical evaluation of all 51 subjects from the 6 families was performed. Individuals were evaluated using ultrasonographic analysis and renal function tests to confirm their disease status, including those showing negative results in mutation screening. Healthy individuals were later used as negative controls for comparative analysis with ADPKD patients.
Abdominal ultrasound. All subjects enrolled in the current study except those who had undergone kidney transplants were instructed to fast for 6 h prior to abdominal ultrasound examinations, which were performed using a Logic 7 GE ultrasound with a curvilinear 3.5 MHZ probe. Focused ultrasound was performed to assess the kidneys, liver, pancreas and ovaries in female subjects. Each kidney was assessed in multiple views. Renal cysts in each kidney were examined and counted for diagnostic purposes according to the unified criteria for ultrasonographic diagnosis of ADPKD 22 . Each total kidney volume (TKV) was calculated using the ellipsoid formula: volume = length × lateral diameter × anterior-posterior diameter × π/6. TKV was calculated automatically by the machine in cubic centimetres (cc) and adjusted for height (htTKV expressed as cc/m) 46 . Each patient's liver and pancreas were also screened for the presence or absence of cysts.
Renal function test. From each subject, 5-mL blood samples were taken and used to conduct renal function test (RFT), which were performed at the Mubarak Al-Kabeer Hospital, Jabriya, Kuwait. Serum creatinine levels www.nature.com/scientificreports www.nature.com/scientificreports/ were determined for each patient and expressed as µ mol/L. Estimated glomerular filtration rates (eGFR) were calculated using the CKD-EPI creatinine equation (2009) developed by Levey et al. 47 . Calculations were performed using the GFR calculator provided on The National Kidney Foundation website www.kidney.org.
The RFT was not performed for ADPKD patients who reached ESRD, were on dialysis or had undergone a kidney transplant. For these patients, only abdominal ultrasound and genetic testing were performed to confirm ADPKD diagnosis, and their eGFR and htTKV were not included in the analysis. DNA isolation. From each subject, 10-mL blood samples were collected at the Nephrology Department of the Mubarak Al-Kabeer Hospital, Jabriya, Kuwait and genomic DNA was isolated from blood samples using a Gentra Puregene Blood Kit (Qiagen, 158467) following the manufacturer's protocol.
Mutation screening and classification of variants. Long-Range PCR. Mutations were screened in the genomic DNA of all subjects using locus-specific amplification and direct sequencing of exonic and flanking intronic regions of PKD1 and PKD2 11 . Segregation was tested in family members using sequence analysis of the relevant genomic fragments. The significance of missense variants was assessed using the ADPKD Mutation Database (http://pkdb.mayo.edu), multi-sequence alignments and substitution assessment tools (SIFT, PolyPhen2 and Align GVGD), as previously described 11,20 Novel frame-shifting insertions and deletions were defined as pathogenic.   www.nature.com/scientificreports www.nature.com/scientificreports/ Whole exome sequencing. DNA samples from 26 ADPKD patients from the 6 families were prepared and enriched using TruSeq v3, SureSlect v4 or v6 following the manufacturer's protocol. Agilent's QPCR NGS Library Quantification Kit (G4880A) was used to determine the DNA concentration of each library prepared. Enriched samples were pooled at a final concentration of 10 nM. Exome sequencing was performed using the Illumina HiSeq2000 platform. For mapping and alignment, read files (Fastq) were generated and obtained from the HiSeq2000 platform using the manufacturer's proprietary software. Burrows-Wheeler Aligner package version 0.6.1 48 was used to locate reads in the most recent map of the human genome (hg38/GRCh38). To ensure a minimum number of mismatching bases across the reads obtained, which in turn reduces false-positive SNP calls (indels), we utilised Genome Analysis Tool Kit (GATK) version 1.6 35 , which locally realigned mapped reads around potential insertion/deletion (indel) sites. Picard version 1.62 was used to label duplicate reads so as to remove those likely resulting from PCR bias. Generated BAM files were further manipulated using Samtools version 0.1.18 49 . To improve the quality of variants calls, GATK's covariance recalibration was used to recalibrate base quality (Phred scale) scores. GATK Unified Genotyper was used to call SNP and indel variants in each sample 50 . Variant novelty was determined using dbSNP.
For variant calling and analysis, we used Golden Helix SNP & Variation Suite version 8.3.4 for Win64. We performed multiple sample variant calling to reduce calling sequencing errors and to enhance the accuracy and sensitivity of calling [51][52][53] . Filtered and unfiltered VCF files of all samples were uploaded and Homo sapiens (Human), GRCh37 g 1k (Feb2009) was used as the default genome assembly. PKD1 (NM_001009944.2) and PKD2 (NM_000297.3) variant annotations and analyses were performed using the dbNSFP Functional Predictions and Scores 2.9, GHI database [54][55][56] , which provided pathological prediction scores from prediction algorithms, including SIFT, Polyphen2, LRT, FATHMM and MetaLR; and conservation scores from phyloP100way_vertebrate, phastCons100way vertebrate, GERP++ and SiPhy and other function annotations, which were used to assess pathological impact. All non-synonymous and insertion/deletion variants were tested and validated in healthy and affected members of the ADPKD families using Sanger sequencing.
No filters were applied when calling variants to assess the quality of called variants in PKD1 and PKD2. The quality was assessed by analysing the reading depth (RD), mapping quality (MQ), genotype quality (GQ), phred scale quality and quality by depth for each called variant. When the same variant was called in more than one sample, the mean of the quality measure was obtained and the standard deviation (STD) was calculated.
The PKD1 and PKD2 regions captured using WES were obtained by plotting BAM files against RefSeq genes 105v2, NCBI to highlight the gene regions. In addition, the total reads per exon in both genes were calculated for each sample and presented graphically.
Whole genome sequencing and analysis. DNA was extracted from blood samples using Qiagen Kits according to the manufacturer's protocol. DNA quality and concentration were determined, and then 1 ug of the DNA was used for WGS using the Illumina TruSeq DNA sample preparation guide to obtain a final library of 300-400 bp average insert size. Covaris systems, which produces dsDNA fragments with 3′ and 5′ overhangs, was used to fragment genomic DNA that was then converted to have blunt ends using an end repair mix; the 3′ overhangs were removed using 3′ to 5′ exonuclease, and the 5′ overhangs were filled by the polymerase. The library size was selected using different ratios of the sample purification beads. Ligation was prevented by adding a single adenine nucleotide to the 3′ end of the blunted fragments, and the corresponding thymine nucleotide provided a complementary overhang on the 3′ end of the adapter. Multiple indexing adapters were ligated to both ends of the fragments in preparation for hybridisation in the Illumina flow cell (Illumina Hiseq2500). Isaac aligner software was used to align NGS sequencing data 57 . Resulting VCF files were analysed using golden helix software for the WES data.
Targeted enrichment arrays. Four samples with mutations in exons 1-32 of PKD1 that were detected by LR-PCR but not WES were selected for this analysis. HaloPlex Custom kits (Agilent Genomics) were used to design panels covering PKD1 and PKD2 coding exons, such as 5′ UTR and 3′ UTR. The design was made for the Illumina platform using RefSeq, Ensembl, CCDS, Gencode, VEGA and SNP databases, and the designed panels were used according to the manufacturer's protocol (G9900-90020). Generated BAM and VCF files were analysed using Golden Helix SNP & Variation Suite version 8.3.4 for Win64. Homo sapiens (Human), GRCh37 g 1k (Feb2009) was used as the default genome assembly. We eliminated all filters for the analysis to evaluate the quality of all called variants, which was assessed by analysing the RD, MQ, GQ, phred scale quality and quality by depth for each called variant.

Statistical analysis.
To test the hypothesis that the slope for predicting htTKV levels from subject age is greater for ADPKD patients than for healthy subjects, linear regression analyses were performed on htTKV-age data from ADPKD patients and healthy subjects. This was followed by a comparison of the resulting regression slopes using Student's t-test to evaluate the difference between the slopes. Similar analyses were also performed on the eGFR-age data.
Kaplan-Meier renal survival analysis. Survival times were calculated as the time of onset of ESRD, and a Kaplan-Meier product-limit survival curve was constructed using MATLAB and Statistics Toolbox, Release 2012b (The MathWorks, Inc., Natick, MA, USA). Twenty-three patients had already reached ESRD by the time of the analysis (some of the ESRD patients were part of the 6 families enrolled in the current study, but were excluded from the genetic testing as they died prior to screening). The median survival time was calculated as the smallest survival time for which the estimated probability of renal survival was ≤0.5. The mean survival time was calculated as the area under the Kaplan-Meier survival curve 58 .