Human whole-exome genotype data for Alzheimer’s disease

The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer’s Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community.


Tables and Figures:
Table 1: Can you provide the number of cases and controls per study/capture to understand better how the study is distributed across captures?New figures: Because the study describes a modified calling pipeline, can a figure be added showing the step-by-step workflow of processing the data?Because the pipeline largely uses the existing "VCPA" pipeline, it would help disentangle the genome and exome-specific components.I don't think the details of the VCPA pipeline should be presumed.
More importantly, it would be helpful to have a PCA plot highlighting diversity within this sample.Separately, it would be helpful to see a plot showing the relatedness within this sample as an additional data set descriptor.

Analysis:
From Figure 1, it appears that the overall overlap between captures is very low (0.586) when they should all target the exome.Can you provide more context as to why that is and how that affects the overall use of the data?Furthermore, what does this mean for capturing the coding region?Rather than focusing on the capture kit, we can look at the coverage of the coding region instead.
Is there a reason why the 20x coverage is used to determine the quality of these data?Can you compare the distribution of coding variants (syn, mis, PTVs) in these data compared to other published data (like gnomAD)?
As this document is a "Data Descriptor," the overall text is light on analyses.Given the observation of batch effects (by capture), can you highlight how you would recommend that this data be analyzed for gene discovery?A useful "QC" metric compares the rates of synonymous variants between cases and controls on a gene-by-gene basis.This ensures that the QC is performed adequately.Can you provide this figure as an additional bit of analysis?
In that vein, the absolute number of coding variants (nhets, nhoms, nhomvar) in each cohort are useful metrics to see how comparable the sample is before and after QC.Can you provide that as a separate figure in the text (split by cohort, capture)?Code: Usually, I do not over-emphasize code in a review of a paper, but because this is a resource paper, I took a deeper look.The code and the Gitbucket repo should be directly highlighted in the text (https://bitbucket.org/NIAGADS/vcpa-pipeline/src/master/VCPA/).From a look at the repo, most scripts are from an earlier iteration focusing on whole-genome sequence data (2018?).It will be helpful to highlight the workflow and scripts that are specific to the exome data and how to use them.The Supplementary Note includes two commands from GATK, and I would prefer that to be directly in the codebase.Furthermore, I cannot access the wiki, which might be helpful in understanding how to run the code.

Minor:
There needs to be some editing and clarification of the acronyms in the text.A few examples include QC'ed and VCPA in the background.
Can a variant-only file with allele frequencies like gnomAD be provided to the community?Reviewer #3 (Remarks to the Author): The Data Descriptor manuscript by Leung et al. describes an expanded cohort of whole exome sequencing data from over 20,000 individuals with a diagnosis of Alzheimers disease.
The manuscript is comprehensive, well written and describes an exceptional resource that is important for the research community.
I have a series of minor questions and comments about the manuscript that are focused on the improvement of the manuscript.
1.It would be informative to add the number of samples in the study that used each capture technologies to table 3.

2.
For the comparison of 20x sequence coverage (Figure 3 and text), it is unclear filtering of reads if any was performed prior to assessment (ie were duplicate reads, unpaired reads and/or >Q30 mapping score reads excluded).
3. One area which is unclear in the current manuscript is the quality of genotype calls across sites.I think the manuscript would benefit from an assessment per variant site of the genotype call rate when genotype calls with GQ<20 (or some justifiable threshold) are set to missing -a histogram across variant sites would be informative (potentially also stratified by capture platform).As this will really get to the crux of the challenge with this call set, which is that due to the differences in capture technologies one would not expect high quality data for all samples at all sites.
4. The statement in the discussion that the dataset is 'free of batch effects' is an overstatement and I am sure that it would be possible to identify batch effect by call-rate (see above comment), the authors may want to revise this statement to reflect that they have employed a harmonisation approach to alignment and variant calling to minimise the impact of different analytical pipelines.

Response:
We thank the reviewer for the comments.We have included results from these statistical tests into "Data quality -WES CRAMs" under RESULTS and in Supplementary Tables 1-10.
Sequencing platforms or centers have a bigger effect on "Quality of reads, Q30", as this metric reflects the quality of the reads from the sequencers, which is directly dependent on the sequencing methods/protocols.The drastic differences observed for Q30 indicated that there are indeed differences at the sequencing methods level, yet after processing all the data using the same bioinformatics approach we proposed in this paper (VCPA-WES), the variabilities of data contributed by sequencing centers and platforms have much been reduced (as seen from the metrics at the mapping, duplicated or paired reads level).
We next performed the same analyses on 20X coverage instead of CRAM metrics, stratified by either the sequencing centers (Figure 5a), or the sequencing platforms (Figure 5b), all pair-wise tests (except one) were statistically significant (Bonferroni corrected p<0.05).This is as expected as the 20X coverage was calculated on the capture regions per sample, of which the differences are due to the selection of various sequencing methods/protocols.

Reviewer #2 (Remarks to the Author):
This paper describes a joint calling strategy for the meta-analysis of Alzheimer's Disease exomes, considering using capture kits and detecting batch effects during the QC process.The data has been deposited.Although the generation of genomic resources for the community is appreciated, joint calling and sample QC is now standardized through existing tools.This is especially true for a GATK-based pipeline and is usually presented as part of an analysis paper rather than a standalone paper.
Examples of such datasets from the last five years include the gnomAD project, the Autism Sequencing Consortium, Deciphering Developmental Disorders, TOPMED, and UK10K studies, all of which handle multiple capture kits or include whole-genome and whole-exome sequencing data.Therefore, while this paper can still be valuable (as shared community resources are), a few areas need additional clarity and detail to maximize value for the scientific community.
Tables and Figures: 1. Table 1: Can you provide the number of cases and controls per study/capture to understand better how the study is distributed across captures?
Response: We thank the reviewer for this suggestion.We have added in Table 1 the number of cases and controls per study/capture as requested.Here, the total number of samples were shown, with the proportion of cases displayed in percentages.

2.
New figures: Because the study describes a modified calling pipeline, can a figure be added showing the step-by-step workflow of processing the data?Because the pipeline largely uses the existing "VCPA" pipeline, it would help disentangle the genome and exome-specific components.I don't think the details of the VCPA pipeline should be presumed.

Response:
We thank the reviewer for this comment.We have included the step-by-step workflow of the pipeline into Figure 1.Those parts highlighted with a "star" are exome specific components (VCPA-WES).
3. Demographic information, especially related to genetic ancestry, should be reworded slightly in consideration of https://nap.nationalacademies.org/resource/26902/interactive/.More importantly, it would be helpful to have a PCA plot highlighting diversity within this sample.Separately, it would be helpful to see a plot showing the relatedness within this sample as an additional data set descriptor.

Response:
We understand the reviewer's concern and addressed through a detailed description of our population substructure assessment of the datasets and assignments of ancestry group labels to distinct subsets of the data.We refer specifically to the guidelines set forth by the NASEM report and we verified that our approach met the standards set forth.It should be noted that as these data have already been made available for public use, the ancestry group labels used reflect the data that have been released, and cannot practically be changed without a significant secondary release of the data, however the nomenclature used to identify these ancestry groups has been discussed and justified in accordance with the approach set forth in the report.
The description on how the ancestry subgroups were identified is described in "Quality control (QC) protocol for WES samples" under METHODS, and the suggested PCA plot is shown in Figure 1 (RESULTS: "Population substructure analysis").
Analysis: 4. From Figure 1, it appears that the overall overlap between captures is very low (0.586) when they should all target the exome.Can you provide more context as to why that is and how that affects the overall use of the data?Furthermore, what does this mean for capturing the coding region?Rather than focusing on the capture kit, we can look at the coverage of the coding region instead.
Response: Though the capture kits were in theory used for targeting exomes, some of the manufacturing companies have included miRNA or lncRNA sequences to the captures.This can be seen in an independent effort by UCSC, who tries to curate the diversified contents and file sizes of captures summarized here: "Exome Capture Probesets and Targeted Region": https://genome.ucsc.edu/cgibin/hgTrackUi?g=exomeProbesets&hgsid=1066518897_QJL7hsBNGEhTnw6DgqcZaMG4YFB2 .
To show the fundamental differences in designs across the captures, we included the original number of capture regions as well as the number of lifted-over capture regions per captures (Table 3, new columns 1 and 2).As can be seen, one capture (Roche_SeqCap_EZ_Exome_Probes_v3.0_Target_Enrichment_Probes) has 1.5 times the number of probes compared to most of the others.
Coverage of the coding region (exons) as suggested by the reviewer is computed on 100 randomly selected samples (based on different studies-sequencing_center-capture combinations).Those were compared against the original 20x coverage calculated on the capture region in a scatter plot (Supplementary Fig. 1).The 20x coverage at the capture regions (88.5±7.8) is similar to that of the coding regions (88.2±2.5)across samples, though the 20x coverage values in coding regions are more uniform across samples.We have added these results in "Data quality -WES Compressed Reference-oriented Alignment Maps (CRAMs)" under RESULTS.

5.
Is there a reason why the 20x coverage is used to determine the quality of these data?
Response: This is a standard/common practice to look at 20x coverage for WES.We have included in the text (under "Processing WES using VCPA at the individual sample level" in METHODS) multiple citations for referencing 20x coverage chosen for accessing the quality of the WES data: In the released dataset, gnomADv2.1 (GRCh38) contains 17.2 million variants in 125,748 WES samples as compared to 8.2 million variants in 20,504 WES samples in the current dataset.
-10.6 million (62%) and 4.0 million (49%) are coding variants in the gnomADv2.1 and the current WES data respectively.-We next compared the percentage of variants that are annotated as 'high impact' in each set.There are 711,024 (6.7% of coding) and 220,987 (5.5% of coding) high impact coding variants in the gnomADv2.1 and the current dataset respectively.
We have added these into "Annotation results" under RESULTS.
7. As this document is a "Data Descriptor," the overall text is light on analyses.Given the observation of batch effects (by capture), can you highlight how you would recommend that this data be analyzed for gene discovery?A useful "QC" metric compares the rates of synonymous variants between cases and controls on a gene-bygene basis.This ensures that the QC is performed adequately.Can you provide this figure as an additional bit of analysis?
Response: As part of the documentation for data releases to ADSP, we typically include a "Quick Start Guide" that provides a detailed overview of recommended postprocessing and analysis of the data.We have included this text in a new section in METHODS entitled "Recommended Post-QC Processing for Analysis" where we describe these steps, we hope that this adds more analysis-relevant detail so that the paper is no longer light on analysis.
In addition to this, in response to this and other concerns of the reviewers, we have added more extensive analyses of the data being shared to better characterize the resulting quality of the data from QC.We have taken the reviewers suggestion of comparing the rates of synonymous variants between cases and controls and added a new table summarizing these values exome-wide to the paper (Table 4).We have also included a figure to show the gene-by-gene synonymous variant frequency differences between cases and controls on studies/QC subsets that are of biggest sample size (Figure 7).We have added some text to in the "Data quality -variants" (RESULTS) describing what we observed here.11).We have also shown in Figure 8 the "Ratio of Post-QC to Pre-QC Genotype Counts" across QC subsets or all cohort-capture combinations.The ratio of all genotypes is fairly consistent across different QC subsets (0.922-0.956).

In
This paper describes a joint calling strategy for the meta-analysis of Alzheimer's Disease exomes, considering using capture kits and detecting batch effects during the QC process.The data has been deposited.Although the generation of genomic resources for the community is appreciated, joint calling and sample QC is now standardized through existing tools.This is especially true for a GATK-based pipeline and is usually presented as part of an analysis paper rather than a standalone paper.
Examples of such datasets from the last five years include the gnomAD project, the Autism Sequencing Consortium, Deciphering Developmental Disorders, TOPMED, and UK10K studies, all of which handle multiple capture kits or include whole-genome and whole-exome sequencing data.Therefore, while this paper can still be valuable (as shared community resources are), a few areas need additional clarity and detail to maximize value for the scientific community.
Tables and Figures: 1. Table 1: Can you provide the number of cases and controls per study/capture to understand better how the study is distributed across captures?
Response: We thank the reviewer for this suggestion.We have added in Table 1 the number of cases and controls per study/capture as requested.Here, the total number of samples were shown, with the proportion of cases displayed in percentages.
Reviewer response: This table is very useful indeed to understand the dataset structure.However, I am really sorry, but I do not understand what the denominator is for the proportion of cases displayed in percentages.All percentages are very close to 100% and therefore do no sum up to 100% either in line or column, and they are too often close to 100% to represent the % of cases among the sequenced samples within a given study/capture combination.This should be clarified.
2. New figures: Because the study describes a modified calling pipeline, can a figure be added showing the step-by-step workflow of processing the data?Because the pipeline largely uses the existing "VCPA" pipeline, it would help disentangle the genome and exome-specific components.I don't think the details of the VCPA pipeline should be presumed.
Response: We thank the reviewer for this comment.We have included the step-bystep workflow of the pipeline into Figure 1.Those parts highlighted with a "star" are exome specific components (VCPA-WES).
More importantly, it would be helpful to have a PCA plot highlighting diversity within this sample.Separately, it would be helpful to see a plot showing the relatedness within this sample as an additional data set descriptor.
Response: We understand the reviewer's concern and addressed through a detailed description of our population substructure assessment of the datasets and assignments of ancestry group labels to distinct subsets of the data.We refer specifically to the guidelines set forth by the NASEM report and we verified that our approach met the standards set forth.It should be noted that as these data have already been made available for public use, the ancestry group labels used reflect the data that have been released, and cannot practically be changed without a significant secondary release of the data, however the nomenclature used to identify these ancestry groups has been discussed and justified in accordance with the approach set forth in the report.
The description on how the ancestry subgroups were identified is described in "Quality control (QC) protocol for WES samples" under METHODS, and the suggested PCA plot is shown in Figure 1 (RESULTS: "Population substructure analysis").
Reviewer response: Thank you for these added explanations.The legend of Figure 1, panels a to c, is difficult to understand because it seems that for panels b and c, the targeted population is represented in black dots, but for panel a, EUR samples are not in black.Instead CHB samples are highlighted, which is confusing.Is it possible to keep the targeted population as black highlight for each panel?Besides, on panel b, could you explain why some samples that clearly colocalize with JPT or CHB populations (+ two other very close) are included in the AA study?Do they match the +/-3SD rule?If so, is this rule too lenient?
From the Figures, it is hard to understand whether the ancestry analysis defines a partition of non-overlapping groups, or if these ancestry groups are potentially embedded within each other.Could you clarify this point?Analysis: 4. From Figure 1, it appears that the overall overlap between captures is very low (0.586) when they should all target the exome.Can you provide more context as to why that is and how that affects the overall use of the data?Furthermore, what does this mean for capturing the coding region?Rather than focusing on the capture kit, we can look at the coverage of the coding region instead.
Response: Though the capture kits were in theory used for targeting exomes, some of the manufacturing companies have included miRNA or lncRNA sequences to the captures.This can be seen in an independent effort by UCSC, who tries to curate the diversified contents and file sizes of captures summarized here: "Exome Capture Probesets and Targeted Region": https://genome.ucsc.edu/cgibin/hgTrackUi?g=exomeProbesets&hgsid=1066518897_QJL7hsBNGEhTnw6DgqcZ aM G4YFB2 .
To show the fundamental differences in designs across the captures, we included the original number of capture regions as well as the number of lifted-over capture regions per captures (Table 3, new columns 1 and 2).As can be seen, one capture (Roche_SeqCap_EZ_Exome_Probes_v3.0_Target_Enrichment_Probes) has 1.5 times the number of probes compared to most of the others.
Coverage of the coding region (exons) as suggested by the reviewer is computed on 100 randomly selected samples (based on different studies-sequencing_centercapture combinations).Those were compared against the original 20x coverage calculated on the capture region in a scatter plot (Supplementary Fig. 1).The 20x coverage at the capture regions (88.57.8) is similar to that of the coding regions (88.22.5)across samples, though the 20x coverage values in coding regions are more uniform across samples.We have added these results in "Data quality -WES Compressed Reference-oriented Alignment Maps (CRAMs)" under RESULTS.
Reviewer response: ok, this is very helpful. 5. Is there a reason why the 20x coverage is used to determine the quality of these data?
Response: This is a standard/common practice to look at 20x coverage for WES.We have included in the text (under "Processing WES using VCPA at the individual sample level" in METHODS) multiple citations for referencing 20x coverage chosen for accessing the quality of the WES data: Reviewer response: ok, although all those references are getting old.With new sequencing standards and lower prices, the depth of sequencing is a lot higher now than what it was in 2011.20x is still rather low to call with confidence ultra-rare variants or singletons.It could be noisy if the allelic balance is not of a perfect 50%.It is possible that asking for 30x (or more) would be more informative and more discriminant across platforms and capture kits. -10.6 million (62%) and 4.0 million (49%) are coding variants in the gnomADv2.1 and the current WES data respectively. -We next compared the percentage of variants that are annotated as 'high impact' in each set.There are 711,024 (6.7% of coding) and 220,987 (5.5% of coding) high impact coding variants in the gnomADv2.1 and the current dataset respectively.
We have added these into "Annotation results" under RESULTS.
Reviewer response: ok, this answers the comment.
7. As this document is a "Data Descriptor," the overall text is light on analyses.Given the observation of batch effects (by capture), can you highlight how you would recommend that this data be analyzed for gene discovery?A useful "QC" metric compares the rates of synonymous variants between cases and controls on a geneby-gene basis.This ensures that the QC is performed adequately.Can you provide this figure as an additional bit of analysis?
Response: As part of the documentation for data releases to ADSP, we typically include a "Quick Start Guide" that provides a detailed overview of recommended postprocessing and analysis of the data.We have included this text in a new section in METHODS entitled "Recommended Post-QC Processing for Analysis" where we describe these steps, we hope that this adds more analysis-relevant detail so that the paper is no longer light on analysis.
In addition to this, in response to this and other concerns of the reviewers, we have added more extensive analyses of the data being shared to better characterize the resulting quality of the data from QC.We have taken the reviewers suggestion of comparing the rates of synonymous variants between cases and controls and added a new table summarizing these values exome-wide to the paper (Table 4).We have also included a figure to show the gene-by-gene synonymous variant frequency differences between cases and controls on studies/QC subsets that are of biggest sample size (Figure 7).We have added some text to in the "Data quality -variants" (RESULTS) describing what we observed here.
Reviewer response: ok, this answers the comment.11).We have also shown in Figure 8 the "Ratio of Post-QC to Pre-QC Genotype Counts" across QC subsets or all cohort-capture combinations.The ratio of all genotypes is fairly consistent across different QC subsets (0.922-0.956).
Reviewer response: ok, this answers the comment.9. Code: Usually, I do not over-emphasize code in a review of a paper, but because this is a resource paper, I took a deeper look.The code and the Gitbucket repo should be directly highlighted in the text (https://bitbucket.org/NIAGADS/vcpapipeline/src/master/VCPA/).From a look at the repo, most scripts are from an earlier iteration focusing on whole-genome sequence data (2018?).It will be helpful to highlight the workflow and scripts that are specific to the exome data and how to use them.The Supplementary Note includes two commands from GATK, and I would prefer that to be directly in the codebase.Furthermore, I cannot access the wiki, which might be helpful in understanding how to run the code.
Response: We thank the reviewer for checking this out.We have included in the text (INTRODUCTION, DISCUSSION, CODE AVAILABILITY) on where the code can be accessed.Public access is available for both the code repository (https://bitbucket.org/NIAGADS/vcpa-pipeline/src/master/)and wiki.A diagram summarizing the steps and scripts that are new in VCPA-WES pipeline have been highlighted in "stars" and included in Figure 1.VCPA-WES pipeline specific scripts were labeled with "_WES" in the file names for distinction purposes.
Reviewer response: ok, this answers the comment.
11. Can a variant-only file with allele frequencies like gnomAD be provided to the community?
Response: We thank the reviewer for asking this question.Unfortunately, due to NIH policy, we are not allowed to share the allele frequencies publicly.Users can obtain such data if they apply the data via DSS; the full VCF files with complete QC information is behind the firewall for users with approved applications.Research has shown that it is possible to use aggregated allele frequencies to infer back an individual's identity (Homer et al., 2008;Liu et al., 2018).gnomAD is a database containing aggregated allele frequencies for subjects of multiple phenotypes, yet in our study, all subjects are of the same phenotype.The risk of leaking subjects' info will directly infer that subject is part of an Alzheimer's Disease related study and has a significant likelihood to be an Alzheimer's disease patient.
Reviewer response: ok, this answers the comment.

6.
Can you compare the distribution of coding variants (syn, mis, PTVs) in these data compared to other published data (like gnomAD)?Response: We thank the reviewer for this suggestion.We have downloaded the gnomADv2https://gnomad.broadinstitute.org/downloads#v2-liftover)and performed the analyses.In the released dataset, gnomADv2.1 (GRCh38) contains 17.2 million variants in 125,748 WES samples as compared to 8.2 million variants in 20,504 WES samples in the current dataset.
that vein, the absolute number of coding variants (nhets, nhoms, nhomvar) in each cohort are useful metrics to see how comparable the sample is before and after QC.Can you provide that as a separate figure in the text (split by cohort, capture)?
8. In that vein, the absolute number of coding variants (nhets, nhoms, nhomvar) in each cohort are useful metrics to see how comparable the sample is before and after QC.Can you provide that as a separate figure in the text (split by cohort, capture)?Response: Based on the reviewer's suggestions, we have constructed a table of REF/REF, REF/ALT, and ALT/ALT genotype counts summed across coding variants before and after QC for each QC subset (Supplementary Table