Medium-coverage DNA sequencing in the design of the genetic association study

Xu, Chao; Zhang, Ruiyuan; Shen, Hui; Deng, Hong-Wen

doi:10.1038/s41431-020-0656-2

Article
Published: 26 May 2020

Medium-coverage DNA sequencing in the design of the genetic association study

European Journal of Human Genetics volume 28, pages 1459–1466 (2020)Cite this article

519 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

DNA sequencing is a widely used tool in genetic association study. Sequencing cost remains a major concern in sequencing-based study, although the application of next generation sequencing has dramatically decreased the sequencing cost and increased the efficiency. The choice of sequencing depth and the sequencing sample size will largely determine the final study investment and performance. Many studies have been conducted to find a cost-effective design of sequencing depth that can achieve certain sequencing accuracy using minimal sequencing cost. The strategies previously studied can be classified into two groups: (1) single-stage to sequence all the samples using either high (>~30×) or low (<~10×) sequencing depth; and (2) two-stage to sequence an affordable number of individuals at a high-coverage followed by a large sample of low-coverage sequencing. However, limited studies examined the performance of the medium-coverage (10–30×) sequencing depth for a genetic association study, where the optimum sequencing depth may exist. In this study, using a published simulation framework, we comprehensively compared the medium-coverage sequencing (MCS) to the single- and two-stage high/low-coverage sequencing in terms of the power and type I error of the variant discovery and association testing. We found, given certain sequencing effort, MCS yielded a comparable discovery power and better type I error control compared with the best (highest power) scenarios using other high- and low-coverage single-stage or two-stage designs. However, MCS was not as competent as other designs with respect to the association power, especially for the rare variants and when the sequencing investment was limited.

You have full access to this article via your institution.

Download PDF

A cost-effective sequencing method for genetic studies combining high-depth whole exome and low-depth whole genome

Article Open access 07 February 2024

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data

Article Open access 22 November 2023

Introduction

DNA sequencing is a widely used tool in genetic association studies, which aimed to identify potential genes or regions that contribute to specific diseases and traits. Sequencing cost remains a major concern in DNA sequencing study [1], although the application of next generation sequencing (NGS) has dramatically decreased the sequencing cost and increased the sequencing efficiency [2]. The main factor that determines sequencing cost is sequencing depth/coverage, which is defined as the average number of a nucleotide in the genome has been sequenced in one experiment [1]. Here, we used the terms coverage and depth interchangeably for the definition aforementioned, although coverage has also been used to indicate the breadth of a sequenced genome.

A number of studies have been conducted to find a cost-effective sequencing design that can achieve certain sequencing accuracy using minimal sequencing cost [3,4,5,6]. The strategies previously studied can be classified into two groups: single-stage design and two-stage design. The single-stage design is to sequence all the samples using either high (>~30×) or low (<~10×) sequencing depth. The two-stage design is first sequencing an affordable small number of individuals at a high coverage to create a reference panel, then followed by a large sample of low-coverage sequencing and perform dense imputation based on the locally constructed reference. It has been shown that using the local reference panel can help identify rare variants that may be missed in other populations [7,8,9]. There were plenty of studies focused on single-stage design. For example, Flannick et al. [3] developed a statistical framework to compare the sensitivity and specificity of low-coverage (0.5–4×) sequencing and SNP (single nucleotide polymorphism) array (100 k to 2.5 M) individually and jointly. Pasaniuc et al. [4] studied the performance of extremely low-depth (0.1–0.5×) sequencing in genome-wide association study. In addition, Li et al. [10] systematically compared the performance of both low and high depth (2–30×) sequencing for studies of complex traits. Regarding the two-stage design, Xu et al. [11] simulated the entire study process of sequencing-based association study and showed that the two-stage sequencing was an effective approach under certain conditions, such as for the discovery of rare variants.

However, limited studies examined the performance of medium-coverage (10–30×) sequencing for a genetic association study, where the optimum sequencing depth may exist. There were several discussions in previous studies indicating that medium-coverage may have a good performance in NGS study. Shen et al. [5] demonstrated that medium depth could achieve optimal detection power under case-only sequencing design. Nielsen et al. [12] concluded that under a large sample size, medium or low coverage might be more cost-effective than the other study designs. Xu et al. [11] specified a similar trend that the medium depth may be the optimal design in real application by comparing low-, high-coverage, and two-stage sequencing in NGS study. Recently, Gilly et al. [13] compared the genotype accuracy at depths 15×, 22.5×, and 30× by downsampling reads from a cohort of 100 samples. Their result demonstrated that the 15× was possible to achieve near-perfect sensitivity and quality for rare SNP calling and genotyping compared with 30× sequencing.

Given the lack of research on the performance of medium-coverage sequencing (MCS) in genetic association study, we comprehensively compared the medium depth (10–30×) sequencing with high/low depth sequencing in terms of variant discovery and association testing power through a simulation study.

Materials and methods

Simulation data

We generated sets of MCS data using a simulation framework published by a previous study [11]. Independent case-control scenarios were considered having equally distributed samples of European populations based on an additive genetic model. Given a certain sequencing coverage and sample size, samples of sequencing data with SNPs in a region of 100 kb were generated based on the reference genome chromosome 22. The length of 100 kb spans long enough to cover an LD block for European populations. The commonly used Illumina pair-end sequencing reads with 125 bp read length were generated by ART [14]. SNPs with minor allele frequency (MAF) ≥ 0.05, 0.01 ≤ MAF < 0.05, and MAF < 0.01 were defined as common, low-frequency, and rare SNPs, respectively. Among the simulated SNPs, 15 causal SNPs were randomly selected with the effect size (deleterious only) determined by controlling the variance it explained to be ~1% respectively for each selected SNP. In the generated simulation data, the actual variance explained by all causal variants corresponded to 10.9%. We set the disease prevalence to be 9.3%, which is similar to the prevalence of type 2 diabetes in the United States [15]. The disease associations for simulated SNPs were simulated by Hapgen2 [16] and tested by PLINK (v1.07) [17], PLINK/SEQ (v0.10, https://atgu.mgh.harvard.edu/plinkseq/), and SKAT [18]. The procedure was repeated 1000 times for each scenario. The complete simulation code is available at https://github.com/xu1912/spS-Gas.

We considered 16 MCS scenarios, which are presented in Table 1. We selected four levels of sequencing depth to cover the typical settings for medium coverage: 12×, 16×, 20×, and 24×. The sample size ranged between 200 and 4000 to make the sequencing effort comparable with the previously published results, which included 12 high-coverage, 14 low-coverage, and 51 two-stage (high coverage to impute in low coverage) scenarios.

Table 1 Medium-coverage sequencing scenarios considered.

Full size table

The generated data were compared with publicly available results of high-/low-coverage sequencing scenarios, which were produced by the same simulation framework and settings. We kept all the parameters the same except for the coverage and sample size. More details for the simulation framework and settings can be found in the previous article [11].

Model evaluation

We evaluated the various scenarios with respect to the power, type I error, and FDR of variant discovery and association under the constraint of study-wide sequencing investment. Given the same length of sequenced genome content, the sequencing investment can be represented by the sequencing effort, which was defined as the product of the sequencing coverage and sample size of that study [10, 19]. The person depth (pd) defined as 1× coverage for one person was used as the unit of the sequencing effort. One kpd means a sequencing effort of 1000 pd. Among all the scenarios, we picked several levels along the increase of the sequencing effort at 4, 8, 12, 16, 24, 32, and 48 kpd. Each effort level included multiple scenarios by different combinations of the coverage and sample size. At each level, we compared the highest power from medium-coverage scenarios to the best (highest power) from the other (low-, high-coverage, and two-stage sequencing) scenarios by the optimum power ratio (\(\mathrm{OPR} = \frac{{\max \left\{ {Power\,of\,MC\,designs} \right\}}}{{\max \left\{ {Power\,of\,Other\,designs} \right\}}}\)). Using the same scenarios having the highest power, we computed the type I error ratio (TR) of MCS and other sequencing options.

The power we evaluated contained two aspects: the variant discovery power and the association testing power, so did the type I error. The variant discovery power was defined as the proportion of the identified variants among the total variants. The variant discovery type I error was defined as the proportion of the falsely identified variants among the total non-variant sequences. The association power was defined as the proportion of statistically significant variants identified among the total true causal variants. A logistic regression model was employed to perform an association test of the variants with Bonferroni correction for multiple testing. In addition, the rare variants were examined by the region-based association test SKAT as follows: We divided each of the simulated 100 kb sequences into nine regions with a length of 20 kb by a sliding window of 10 kb. Only rare variants were considered for each region. Multiple testing was adjusted by Bonferroni correction. The test result for a specific region using SKAT was checked against whether the region contained any rare causal variants.

Results

We first assessed the performance of MCS in SNP discovery using the proposed simulation methods and compared it to single-stage high-coverage, low-coverage, and two-stage scenarios. For MCS, with the same sequencing effort, sequencing more subjects at a lower coverage usually resulted in a higher discovery power, especially for the rare variants (Table 2 and Fig. 1). For example, given the effort of 48 kpd, the rare variant discovery power increased from 76.82% to 81.98% when the sample size increased from 2000 to 4000 and coverage decreased from 24× to 12× (Table 2). Relative to the other types of sequencing coverage under the same sequencing effort, the MCS did not exhibit greater advantage in the power for variant detection. The MCS only led at effort level of 12 kpd by a narrow margin (Fig. 1). However, we can find a two-stage scenario (100@30 + 4000@2) reaching a higher power with less sequencing effort than the MCS at 12 kpd. At other effort levels, the medium-coverage resulted in detection power close to the best (highest power) of other scenarios, most of which were low-coverage as depicted in Table 3. We defined optimum power ratio (OPR) to compare the best power performance of MCS to other sequencing designs at the same sequencing effort. In Table 3, most of the OPRs are > 0.9 except at the 4 kpd scenario. Similar results were observed for the detection of rare variants (Supplementary Table 1). At 4 kpd, the maximum detection power of MCS was only 68.30% of that using other coverages (38.35% vs 56.15% from low-coverage). On the other hand, MCS had a better type I error control for most of the effort slices. Especially, when the effort exceeded 10 kpd, the MCS outperformed other sequencing options by a low type I error and comparable power in variant detection (Supplementary Table 1). For instance, among 48 kpd scenarios, the optimum power using MCS was very close to that using other coverages (87.52% vs 88.01% from low-coverage) with a less corresponding type I error (0.01% vs 0.02%). At other slices with OPR > 0.9, the type I error of MCS was also much lower relative to the other scenarios with a higher power, such as the 8, 16, and 24 kpd with the type I error ratio (TR) of 18.45%, 33.33%, and 1.36% respectively (Table 3). Further, at 12 kpd scenarios, MCS not only yielded higher power but also much less type I error with TR of ~0.01 (0.01% vs 0.77%).

Table 2 Power and type I error rate of variant discovery and association in medium-coverage scenarios.

Full size table

**Fig. 1: Sequencing investment and discovery power of variants.**

Table 3 Optimum detection and association power comparison between MCS and other sequencing designs.

Full size table

For the association analysis, we first evaluated the power of MCS in testing association for common variants. The sample size played a more important role in association than the choice of sequencing depth for the design using medium-coverage. Given a certain sequencing effort, the larger sample size always produced higher power (Supplementary Table 1). When collapsing the coverage axis (as depicted in Fig. 2a), an increase of sample size led to a power gain (Fig. 2a and c). However, if we collapsed the sample size axis (as depicted in Fig. 2b), the benefit of using a higher coverage was very limited (Fig. 2b, c). In addition, MCS was not as competent as other sequencing coverages in identifying disease-associated variants. The OPR was only 24.71% (7.42% vs 30.04% from low-coverage) when the sequencing effort was fixed at 4 kpd. With the sequencing effort increased from 4 to 48 kpd, the OPR elevated from 0.25 to 0.92 (Table 3). We observed a similar trend for the type I error of association testing using MCS relative to other sequencing coverages. The higher sequencing effort, the closer type I error using MCS to other sequencing options, while the type I error of MCS was always less (Table 3). In Fig. 3a, we plotted the common variant association power versus type I error for the scenarios with optimum power using MCS and other sequencing depth under the same sequencing efforts. The points from optimum MCS and other sequencing coverages likely lie in the same line, which also provided a general idea of how the sequencing-based genetic association study performed given our model assumption. Similar results were also observed for the association test of low-frequency variants (Table 3 and Fig. 3b). However, MCS was able to achieve high power to identify low-frequency variants with less type I error than the other sequencing options. When the sequencing effort is large, at 48 kpd, MCS yielded higher power (64.13% from 4000@12 vs 57.61% from low-coverage 6000@8) and 21% less type I error based on the comparison of same scenarios (17.18% vs 21.71%).

**Fig. 2: Common variant association power versus sequencing depth and sample size.**

**Fig. 3: Common and low-frequency variant association power and type I error.**

Further, we examined the performance of MCS on the rare variants association testing using SKAT. For the MCS scenarios at the same sequencing efforts, we observed the larger sample size, the higher association power (Table 4). Increasing the sample size was more beneficial than increasing the sequencing coverage, not only in common and low-frequency variants testing, but also for the rare variants. In the scenarios with optimum power using other sequencing coverages, the testing power quickly rose to ~90% when the sequencing effort was over 8 kpd. However, the best power from MCS was far less than others at the same investment, most of which were low-coverage and two-stage design as depicted in Supplementary Table 2. The OPRs were all less than 0.5 until the sequencing effort increased to 32 kpd (Supplementary Table 2). Different from the general association test for common and low-frequency variants, the rare variants were grouped into regions and tested by regions. In Fig. 4, the medium coverage may reach the same power of other designs with a less type I error. For instance, the top right end of the MCS designs (4000@12) compared with the bottom left end of the best power of others (low-coverage 2000@2). But MCS was not efficient in rare variant association study with respect to the sequencing investment.

Table 4 Power and type I error rate of region-based rare variant association in MCS.

Full size table

**Fig. 4: Rare variant association power and type I error.**

Discussion

In this study, we investigated the efficiency of sequencing studies using medium-coverage compared with single-stage low-/high-coverage and two-stage sequencing designs. Assuming a disease prevalence of ~9.3% and 15 causal variations, our result is generalizable to the study of complex diseases, like type 2 diabetes, but not rare diseases or single-gene disorders. Our result showed that, given certain sequencing investment, the variant discovery power of MCS was just close to the best power of the other schemes. However, the medium-coverage yielded less type I errors in variant discovery compared with the best (highest power) of others. Further, the variant association testing power of using medium-coverage was not comparable with the best power of others for most of the cases.

Based on our findings, we can make an addendum to existing guidelines on designing a cost-effective sequencing study given a certain sequencing investment. If the study goal is only detecting variants/mutations and there is sufficient sequencing effort (≥12 kpd), medium-coverage is applicable in addition to low-coverage and appropriate two-stage design. Medium-coverage can reach a tightly close power but will be less error-prone relative to low-coverage and two-stage design. However, if the study goal is not only detecting but also identifying disease-associated variants, MCS may not be appropriate. Instead, low-coverage and two-stage schemes should be adopted. In summary, we provided a recommendation table for the selection of low-/medium-/high-coverage and two-stage sequencing when designing a genetic association study (Supplementary Table 3).

Different from the study of SNP, the detection of structural variations (SVs) using NGS mostly relies on the sequencing depth, such as the copy number variations (CNVs) and indels. Gilly et al. found that genotype accuracy is substantially more dependent on sequencing depth for indels than for SNPs [13]. In a recent study, the performance of several CNV detection tools varied with the sequencing depth, with high-coverage resulted in high sensitivity and specificity [20]. We expected the sequencing coverage would play a more important role than that in the present study of SNP. The high-coverage sequencing may be more appropriate for scenarios targeted on SVs. The cost-efficiency of various sequencing coverage in the study of indels, CNVs, and other SVs needs further investigation.

By comparing MCS scenarios together with single-stage low-/high-coverage and two-stage sequencing scenarios, we have investigated nearly all the practically used sequencing depth. Some general guidelines can be concluded to design a sequencing-based association study. First, the sample size is more important than the sequencing depth for the association test. Second, increasing the sequencing coverage does help in decreasing the error of variant detection. Above all, one limitation of our study is that the default parameter setting in the simulation was used without fine-tuning and thus the observed type I error may not represent the true number in real-world analysis. However, our study still faithfully presented the performance of MCS and other sequencing coverage.

References

Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014;15:121–32.
Article CAS Google Scholar
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17:333–51.
Article CAS Google Scholar
Flannick J, Korn JM, Fontanillas P, Grant GB, Banks E, Depristo MA, et al. Efficiency and power as a function of sequence coverage, SNP array density, and imputation. PLoS Comput Biol. 2012;8:e1002604.
Article CAS Google Scholar
Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet. 2012;44:631–5.
Article CAS Google Scholar
Shen Y, Song R, Pe’er I. Coverage tradeoffs and power estimation in the design of whole-genome sequencing experiments for detecting association. Bioinformatics. 2011;27:1995–7.
Article CAS Google Scholar
Wu Y, Zheng ZL, Visscher PM, Yang J. Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data. Genome Biol. 2017;18:86.
Article Google Scholar
Danjou F, Zoledziewska M, Sidore C, Steri M, Busonero F, Maschio A, et al. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat Genet. 2015;47:1264.
Article CAS Google Scholar
Deelen P, Menelaou A, van Leeuwen EM, Kanterakis A, van Dijk F, Medina-Gomez C, et al. Improved imputation quality of low-frequency and rare variants in European samples using the ‘Genome of The Netherlands’. Eur J Hum Genet. 2014;22:1321–6.
Article CAS Google Scholar
Kreiner-Moller E, Medina-Gomez C, Uitterlinden AG, Rivadeneira F, Estrada K. Improving accuracy of rare variant imputation with a two-step imputation approach. Eur J Hum Genet. 2015;23:395–400.
Article Google Scholar
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 2011;21:940–51.
Article CAS Google Scholar
Xu C, Wu K, Zhang JG, Shen H, Deng HW. Low-, high-coverage, and two-stage DNA sequencing in the design of the genetic association study. Genet Epidemiol. 2017;41:187–97.
Article Google Scholar
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12:443–51.
Article CAS Google Scholar
Gilly A, Suveges D, Kuchenbaecker K, Pollard M, Southam L, Hatzikotoulas K, et al. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits. Nat Commun. 2018;9:4674.
Article Google Scholar
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–4.
Article Google Scholar
Centers for Disease C. Prevention: National Diabetes Statistics Report: estimates of diabetes and its burden in the United States. Atlanta, GA: US Department of Health and Human Services; 2014.
Google Scholar
Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27:2304–5.
Article CAS Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.
Article CAS Google Scholar
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89:82–93.
Article CAS Google Scholar
Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP); 2016. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost.
Zhang L, Bai W, Yuan N, Du Z. Comprehensively benchmarking applications for detecting copy number variation. PLoS Comput Biol. 2019;15:e1007069.
Article CAS Google Scholar

Download references

Acknowledgements

The work was partially supported by grants from the National Institutes of Health (R01 AR059781, R01 MH104680, R01 AR069055, U19 AG055373, and P20GM109036), Edward G. Schlieder Endowment, and startup funds from Tulane University. This research was supported in part using high performance computing (HPC) resources and services provided by Technology Services at Tulane University, New Orleans, LA.

Author information

Authors and Affiliations

Center for Bioinformatics and Genomics, Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, 70112, USA
Chao Xu, Ruiyuan Zhang, Hui Shen & Hong-Wen Deng
Department of Biostatistics and Epidemiology, The University of Oklahoma Health Sciences Center, Oklahoma City, OK, 73104, USA
Chao Xu
School of Basic Medical Science, Central South University, 410013, Changsha, China
Hong-Wen Deng

Authors

Chao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ruiyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Shen
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Wen Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Wen Deng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, C., Zhang, R., Shen, H. et al. Medium-coverage DNA sequencing in the design of the genetic association study. Eur J Hum Genet 28, 1459–1466 (2020). https://doi.org/10.1038/s41431-020-0656-2

Download citation

Received: 20 December 2019
Revised: 29 April 2020
Accepted: 12 May 2020
Published: 26 May 2020
Issue Date: October 2020
DOI: https://doi.org/10.1038/s41431-020-0656-2

Medium-coverage DNA sequencing in the design of the genetic association study

Subjects

Abstract

Similar content being viewed by others

A cost-effective sequencing method for genetic studies combining high-depth whole exome and low-depth whole genome

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data

Introduction

Materials and methods

Simulation data

Model evaluation

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

A cost-effective sequencing method for genetic studies combining high-depth whole exome and low-depth whole genome

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Evaluating the performance of low-frequency variant calling tools for the detection of variants from short-read deep sequencing data

Introduction

Materials and methods

Simulation data

Model evaluation

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links