Clinical diagnosis of genetic disorders at both single-nucleotide and chromosomal levels based on BGISEQ-500 platform

Most variations in the human genome refer to single-nucleotide variation (SNV), small fragment insertions and deletions, and genomic copy number variation (CNV). Many human diseases including genetic disorders are associated with variations in the genome. These disorders are often difficult to be diagnosed because of their complex clinical conditions, therefore, an effective detection method is needed to facilitate clinical diagnosis and prevent birth defects. With the development of high-throughput sequencing technology, the method of targeted sequence capture chip has been extensively used owing to its high throughput, high accuracy, fast speed, and low cost. In this study, we designed a chip that potentially captured the coding region of 3043 genes associated with 4013 monogenic diseases, with an addition of 148 chromosomal abnormalities that can be identified by targeting specific regions. To assess the efficiency, a strategy of combining the BGISEQ500 sequencing platform with the designed chip was utilized to screen variants in 63 patients. Eventually, 67 disease-associated variants were found, 31 of which were novel. The results of the evaluation test also show that this combined strategy complies with the requirements of clinical testing and has proper clinical application value.


INTRODUCTION
Monogenic inherited diseases usually involve multiple disciplines and complex clinical symptoms.They are difficult to be precisely diagnosed by conventional clinical tests due to the underlying molecular mechanisms, and most of them are usually fatal, disabling, or teratogenic 1 .Traditional testing techniques may have a greater risk of bringing in false negative diagnosis and misdiagnosis, as a result, the clinicians may miss the critical points to provide treatment for the patients.In comparison, genetic testing can achieve better performance including early detection, early intervention, and early treatment for single-gene genetic diseases.Large-scale discovery of novel genes and validation of monogenic diseases can be quickly implemented and widely applied clinically.People with a family history of genetic disorders can be screened by pre-marital, pre-pregnancy, and prenatal genetic screening 2,3 and avoid birth defects.Therefore, genetic testing is important for clinical diagnosis and prevention of birth defects.
Next-generation sequencing technology has been widely used in detecting genetic disease.The major sequencing technologies are targeted region sequencing, whole exome sequencing, whole genome sequencing, and mitochondrial DNA sequencing.However, whole genome and exome sequencing are not only costly and time consuming, but also challenging to screen for specific disease-causing variantsacross a large span of genomic region 4 .The combination of regional capture and high-throughput sequencing technology can effectively capture diseaseassociated regions and quickly locate disease-causing variants.With the characteristics of high throughput, low cost 5 , high speed, and high accuracy, high-throughput sequencing technology is widely used in clinical practice 6,7 for genetic disease detection and carrier screening 8 .However, most of the currently available products for genetic testing detect limited types of diseases and have a compromised detection rate 9 .Moreover, besides monogenic variants, recent studies have found that chromosome microdeletions or microduplications are important causes of developmental delay and intellectual disability 10 .Therefore, we urgently need a highly efficient and sensitive screening method that can detect all types of variants to meet the need of one-step detection of a variety of monogenic genetic diseases and common chromosomal abnormalities.
Therefore, this study used BGISEQ-500 as a sequencing platform to develop a chip that focuses on coding regions with known associations with genetic diseases.Variants that affect gene function are detected more cost-effectively than whole genome sequencing or whole exome sequencing.Currently, 4013 known single genetic diseases can be detected (Table 1).In addition, we can detect 148 common chromosomal abnormalities by targeting specific regions (Table 2).Compared with traditional gene detection methods, the combined strategy integrates known single-gene diseases with common chromosomal abnormalities,     and therefore achieves "one-step" solution to detecting genetic variants.The improved detection rate of diseases, along with the benefit of high throughput, high accuracy, fast speed, and low cost proves that this combined strategy is a powerful tool for clinical diagnosis and prenatal prevention of birth defects.

MATERIALS AND METHODS Sample information
A total of 100 samples were gathered for this study.Because we designed the chip to capture almost all disease-causing genes (Table 1), the samples are collected based on the patients who would like to participant in this study in the hospital and are essentially unbiased.In order to assess the stability of the chip, we selected two samples, S77 and S78, for inter-batch and intra-batch stability evaluation.In addition, samples S79, S80, S81, and S82 were selected to evaluate the coverage and depth of the target area under the BGISEQ-500 platform.86 patients were selected from the clinical 0.76% CDS regions of S81 with 17.45% coverage, and 0.48% CDS regions of S82 with 12.59% coverage, respectively (Fig. 1).When the sequencing depth reached 30× or more, the coverage was greatly improved.Between 30× and 100× sequencing depths, the coverage of CDS regions of S79, S80, S81, and S82 were 93.84%, 92.29%, 92.75% and 91.81%, respectively.When the depth was greater than 100×, the coverage of all samples can reach more than 99%.A total of 45,527 CDS regions were analyzed, of which 43,192 areas were able to obtain 100% coverage; a total of 2335 areas did not achieve 100% coverage, but it can be seen that, as the depth increases, the coverage increases.In the 45,527 captured regions, 154 CDSs had zero coverage regardless of the read depth.Among these 154 CDSs, CDS1, the first coding DNA sequence, accounts for 87%.We know that the GC content from 5′ untranslated regions to 3′ untranslated regions along human genes gradually decrease 20 .The CDS1 area is next to the 5′ UTR area, possibly because the higher GC content of 5′ UTR affected the capture of the CDS1 area.Based on the above results, it is recommended that on the BGISEQ500 sequencing platform, the average depth of sequencing of the samples using the customized chip of this study should preferably reach 100 X or more after the removal of the duplication.

Inter-batch and intra-batch stability assessment
In this project, sample S77 and sample S78 were sequenced in three batches to evaluate the stability among batches; each sample was sequenced three times to evaluate the stability within the batch.We used the parameter-out_mode EMIT_ALL_SITES to output all the locus detection information in the capture region.Genotypic consistency of loci in different batches of the same sample and the same batch of repeated samples was analyzed.For batch-to-batch stability, the total number of loci was 9,903,792 for sample S77, the intersection of three different batches was 9,881,645, the stability was 99.78% (Fig. 2a); total number of loci was 9,874,160 for sample S78, and 9,852,762 for the intersection of three separate batches with 99.78% for stability (Fig. 2b).In this experiment, we defined stability as the ratio of sites identified in all three technical replicates.For intra-batch stability, the total number of loci in sample S77 was 9,904,450, and the number of intersection loci of three samples in the same batch was 9,882,238,  with 99.78% stability (Fig. 2c); for sample S78, the total number of loci was 9,877,841, and the number of intersection loci of three samples in the same batch was 9857 175, the stability was 99.79% (Fig. 2d).From the above data, it is confirmed that the stability of the customized chip is quite good among batches and within batches on the BGISEQ500 sequencing platform.To evaluate the accuracy of this technique, we compared the SNPs of YH cell line samples tested using targeted NGS with the genotyping results obtained using Illumina's Human Zhonghua-8 bead Chips (SNP Array).We selected the common locus between the SNP array and the chip designed in this experiment for accuracy analysis.A total of 3664 SNPs were detected in YH cell line, and 99.54% (3647/ 3664) of the genotypes at the selected loci were consistent with the results of SNP Array, demonstrating the high accuracy of this method.

Variant information in clinical samples
Using targeted next-generation sequencing (NGS), we obtained high-quality sequences of 86 samples.Variant-related information was obtained after the completion of the reference sequence alignment and variant detection.In this study, 67 disease-related variants were identified in 52 patients, including 49 missense variants, 8 frameshift variants, 5 splicing variants, 3 intra-gene deletion and duplication, and 2 whole gene deletions.Of the 67 variants, 36 have been reported and 31 have been reported for the first time.Table S1 summarizes the disease-related variant information for 52 samples.

Chromosome abnormality detection
This study used CNVkit software to detect chromosomal abnormalities.The software detects CNV based on the read depth method.Therefore, in addition to the original depth, 10,000,000 reads and 20,000,000 reads are randomly extracted, simulating different sequencing depths for CNV copy number and breakpoint position detection.When the data showed that the original depth was 613×, there was one area that remained undetected.This area was chr7: 69, 783, 279-69, 952, 448, with the segment length of 169.17 Kb, and the area is not detected at three different depths, namely 613 × (original depth), 140 × (20,000,000 reads) and 70 × (10,000,000 reads).Therefore, it is speculated that the detection accuracy of the customized chip is insufficient to detect a deletion or a duplication of about 200 kb.In addition, the recommended detection accuracy of CNVkit software is 1 M, and it was found that all the deletions and repetitions above 1 M were detected.CNVkit software detects chromosome deletions and duplications based on the depth of reads.The results also confirmed that as the depth decreases, the number of missed detection areas increases, so it is recommended to ensure a certain amount of depth to help reduce the rate of missed detection.Table 3 shows details of the CNV results information for samples.

Protein structure prediction and stability results
We performed protein modeling analysis on all genes defined as uncertain significance, of which only six genes were modeled completely and included mutant amino acids in their sequence (Table 4).The six genes were: ANLN, CNGB1, UMOD, DSTYK, UNC45B, and COL4A3.In the structure of ANLN, Asp1021 is located at the carboxy terminus of the Anillin protein and belongs to the PH (Pleckstrin homology) domain, which is necessary for all targeted events 21 .The PH domain is a 120 amino acid protein module that is thought to interact with lipids to mediate protein recruitment to the plasma membrane, and studies have shown that the PH domain is electrostatically polarized 22 .To examine how the p.D1021V variant would affect protein structure, we compared the structure of the wild-type and the mutant, and found that the conformation was basically unchanged.In addition, Gibson's free energy calculated by foldx also indicates that the variant does not affect the stability of the protein.
The CNGB1 variant p.M974R, UMOD variant p.V550I, and COL4A3 variant p.A1555V were calculated by foldx, with the change in ΔG Gibbs free energy of 4.07063 kcal/mol, 4.01864 kcal/ mol, and 2.46126 kcal/mol, respectively.This indicates that these variants affect the stability of the protein.

DISCUSSION
The study of monogenic hereditary diseases belongs to the field of typical precision medicine.The complex clinical symptoms of monogenic diseases lead to a difficult diagnosis, and most of the pathogenic mechanisms are not clear.Due to the lack of effective treatments, the disease is often fatal, disabling or teratogenic.Diseases such as intellectual disability and growth retardation are often caused by chromosomal abnormalities in addition to the single-gene variants, which are also responsible for monogenic genetic diseases.Therefore, we urgently need an effective detection method that can detect both monogenic genetic variants and chromosome aberrations to facilitate clinical diagnosis and prevention of birth defects.This study designed a chip that can detect up to 4013 single-gene diseases.Compared with previous panel designs 6,7 , we have included more genes related to mendelian diseases when designing the chip to improve our diagnosis rate.In addition, this study also identified 148 common chromosomal disorders by targeting the key genes as well as the random, non-critical genes in chromosomal abnormal regions.In this study, we use MGIEasy Exome Capture V5 Probe to bridge the cost gap between the panel and WES.When their average depth is 200 X, the cost of the panel is approximately 1700 RMB, while the cost of WES is approximately 2300 RMB.The primary reason for the disparity in trial costs between the two is the expense of sequencing.Due to the modest amount of data generated by the panel, the time and personnel costs associated with bioinformatics processing and interpretation will further contribute to the cost differential between the two tests, which we did not specify in this study.Because the amount of data created by the panel is reduced over time, the cost of data storage is reduced.When the sample size hits a particular threshold, it can become rather costly.This project uses the strategy of BGISEQ500 sequencing platform and chip combination.Due to its low cost, the evaluation results indicate that this combination has potential for clinical testing and carrier screening applications.
Sequencing analysis is effective for the diagnosis of rare genetic diseases, but the relationship between effectiveness and costeffectiveness for the use of comprehensive analyses such as whole genome sequencing and whole exome sequencing remains controversial.Target capture analysis enriches genes or regions of interest and is an analytical method that balances cost and effectiveness.The chip designed in this study encompasses the majority of currently known disease-causing genes that can cause genetic diseases, and can be considered a clinical-grade whole exome.The panel can more effectively target disease-related regions of the human genome and, more importantly, achieve higher sequencing coverage when targeting a group of genes associated with a particular disease phenotype.In this study, for the analysis of CDS coverage, sample coverage reached 99.66% when sequencing depth exceeded 100*, and coverage increased as sequencing depth increased.
Nevertheless, a high-resolution assessment of various WES datasets reveals unequal coverage along the length of exons 23 .Studies reveal that regions with inadequate WES coverage account for around 10% of all CDS regions 24 .We also analyzed the coverage of genes recommended by the American College of Medical Genetics and Genomics (ACMG) for pathogenic variant detection and clinical reporting 25 .Among the 59 genes analyzed, APOB CDS1, DSC2 CDS1, PRKAG2 CDS5, RET CDS1, and TGFBR1 CDS1 were identified.Regardless of how much the sequencing depth is increased, there is no coverage(Table S1).Six genes, including KCNH2, KCNQ1, SDHD, TNNI3, VHL, and WT1, have been identified inside low-coverage regions in one or more samples, according to additional research 26 .These results imply that lowcoverage regions inside functionally significant genes may influence variant detection and subsequent clinical diagnosis.Moreover, with the same amount of detection data, the chip can obtain higher depth sequencing data than WES, which is advantageous for detecting structural variation at the exon level, and we know that certain diseases, particularly neurological diseases like DMD, can cause by structural variation at the exon level.The clinical application of WGS is still limited at this time for two reasons: first, the interpretation of non-coding regions is extremely limited and relies on scientific research, and second, the cost is prohibitive for the subject.Taking into account the potency ratio, this chip containing nearly all genes with distinct molecular mechanisms continue to be an excellent option.Diseases such as McCune-Albright syndrome are caused by variants in early embryonic somatic cells.Conventional WES analysis, particularly in the clinical setting, may not detect somatic variants.However, this chip has some remaining limitations.In fact, in the era of clinical genomics, where reverse phenotyping has become commonplace 27 , WES can provide early diagnosis and drive treatment options.WES was selected to expedite potential diagnoses and reduce costs associated with multiple tests.Overall, the panel lacks the advantages of a larger number of candidate genes and the ability to reevaluate data on a regular basis, which are offered by WES.
For 86 clinical cases, we first found candidate pathogenic genes in the list of 4,013 diseases based on clinical diagnosis and used the targeted NGS to find pathogenic variants in the candidate genes.If the variant is indeterminate based on the results of the information analysis and database annotations, we will plot the reads and align the reference sequences of the variant sites with a single base resolution.If the variant is still unrecognized, Sanger sequencing or real-time PCR will be performed.However, the pathogenic variants in some cases are still not in the candidate gene.We will find candidate variants in other genes in the target region and to infer the disease in reverse.
In this study, we performed homology modeling on some proteins, hoping to be able to explain the changes in protein structure from variants.Sample S32, 7 years old, shows clinical manifestations of hematuria and C3 glomerulopathy.Missense variation c.3062A>T (p.D1021V) was detected in the ANLN (NM_018685.4) gene coding region of the sample as a heterozygote.ANLN gene variant can cause focal segmental glomerulosclerosis type 8 (OMIM#: 616032), which is autosomal dominant, and the main clinical manifestation of glomerular segmental sclerosis, proteinuria, decreased glomerular filtration rate and progressive decline in renal function.Both SIFT and PolyPhe-2 predictions are deleterious variants.The frequency information of c.3062A>T was not found in the dbSNP database, Hapmap database, thousand-person database, or the local database, and there is no documented pathogenicity.In the structure of ANLN, the variant p.D1021V is located in the PH (Pleckstrin homology) domain.Anillin is an actin-binding protein involved in cytokinesis.It interacts with GTP-bound Rho proteins and results in the inhibition of their GTPase activity.The PH domain has multiple functions, but generally involves targeting the protein to an appropriate cellular location or interacting with a binding partner.The PH domain is in electrostatic polarity, because aspartic acid is charged and polar and is often involved in the formation of protein active sites or binding sites, while proline is a non-polar amino acid.Comparing the wild-type and mutant conformations, no changes were found, but there were some differences in the hydrophobic surface.We speculated that the variant affected the electrostatic polarity of the PH domain, resulting in a change in protein function.Therefore, it is speculated that the ANLN gene c.3062A>T is a disease-causing variant in the subject.
CNV is widely distributed in human genome and is one of the important pathogenic factors of human diseases.Pathogenic CNV can cause intellectual disability, growth retardation, autism, various birth defects, leukemias, and tumors.Determining the copy number and breakpoint position of the variant region are two crucial aspects of CNV detection.With the advancement of technology, more and more technical means have emerged for CNV detection, but different technology platforms and their corresponding computing strategies have great differences in the accuracy of detected CNV copy number and breakpoint position.The CNVseq method uses genome-wide data, and this study utilizes genomic target region data.Although two methods for detecting CNV are based on the circular binary segmentation algorithm, there are still differences in data correction and comparison.Based on the above reasons, the position of the breakpoints obtained by the two methods is not very consistent, actually the breakpoint positions identified by the two different methods in our study all vary at the kilo bps resolution level.This study uses CNVkit software, which detects CNV based on the read depth method.Therefore, in addition to using the original data, we also simulated different sequencing depths for CNV copy number and breakpoint position detection.As the depth decreases, the number of missed detection areas increases, and a certain number of read lengths help to reduce the rate of missed detection.At breakpoint locations, different depths have no significant effect on the detection of breakpoint locations.Based on a similar capture sequencing technology, the difference between exome sequencing and target capture sequencing during experiments and bio-information analysis is still usually significant.Factors such as the GC content of the probes, the initial DNA concentration, and even the temperature of the chip hybridization in the experiment may affect the number of reads captured by each probe and make a difference in capture efficiency, depth, and coverage.Indeed WESs can accurately detect CNVs above 1 M, but our research based on a specific panel to detect these common chromosomal CNVs is extremely costeffective.

CONCLUSION
In summary, we provide a diagnostic detection tool that combines capture arrays and NGS to capture the coding region of 3043 genes associated with 4013 diseases and detects 148 chromosomal abnormalities by targeting specific regions.The results of the evaluation suggest that our method has high accuracy and stability.Compared with traditional genetic testing methods, it integrates known data about single-gene diseases and frequent chromosomal abnormalities to achieve a "one-step" solution to genetic variants.In our study, perhaps due to high GC content, missing enrichment probes, and other reasons, there are still 154 CDSs regions that cannot be covered at all.The incomplete coverage of regions may be improved by using a high concentration of capture probes that cover difficult-to-enrich regions 28,29 .This technology can be potentially utilized in diagnostic testing to provide an effective basis for clinical diagnosis and genetic counseling and improve the detection rate of diseases.

Fig. 1
Fig. 1 Relationship between Sequencing depth and coverage in CDS region.The columns indicate the proportional distributions of CDS regions with different sequencing depths for sample S79 (157× average), sample S80 (231× average), sample S81 (277× average), and sample S82(380x average), respectively (refer to the left coordinate).The solid dots (circles) represent the average coverage in CDS regions with different sequencing depths (refer to the right coordinate).

Fig. 2
Fig. 2 Evaluation of the stability of our method.Venn diagram of S77 (a) and S78 (b) sequenced three times in the same batch.Venn diagram of S77 (c) and S78 (d) sequenced three times in three batches.

Table 1 .
List of 4013 diseases that can be detected by the designed

Table 2 .
List of 148 chromosomal abnormalities that can be detected by the designed chip.

Table 3 .
Details of CNV detection results at different sequencing depths.

Table 4 .
Structural analysis of four mutant proteins.