PacBio long-read amplicon sequencing enables scalable high-resolution population allele typing of the complex CYP2D6 locus

The CYP2D6 enzyme is estimated to metabolize 25% of commonly used pharmaceuticals and is of intense pharmacogenetic interest due to the polymorphic nature of the CYP2D6 gene. Accurate allele typing of CYP2D6 has proved challenging due to frequent copy number variants (CNVs) and paralogous pseudogenes. SNP-arrays, qPCR and short-read sequencing have been employed to interrogate CYP2D6, however these technologies are unable to capture longer range information. Long-read sequencing using the PacBio Single Molecule Real Time (SMRT) sequencing platform has yielded promising results for CYP2D6 allele typing. However, previous studies have been limited in scale and have employed nascent data processing pipelines. We present a robust data processing pipeline “PLASTER” for accurate allele typing of SMRT sequenced amplicons. We demonstrate the pipeline by typing CYP2D6 alleles in a large cohort of 377 Solomon Islanders. This pharmacogenetic method will improve drug safety and efficacy through screening prior to drug administration.

The authors describe a pipeline for accurate allelotyping of the CYP2D6 gene using long read amplicons. CYP2D6 is an important locus due to its relevance in metabolism of drugs and the presence of a pseudogene complicates the process of sequencing this locus. There have been many published papers on genotyping CYP2D6 using long read sequencing technologies (as summarized by the authors in Table 1). The strengths of this paper -as summarized by the authors -are a robust data processing pipeline and the large number of samples that they genotyped. Overall, the paper is well written and easy to follow. Since this problem has been tackled by several groups, I think that the authors could have done additional analysis in order to distinguish their work from previously published work and demonstrate the broad utility and robustness of their pipeline.
1. There have been several pipelines developed for genotyping CYP2D6 from Illumina WGS data (e.g. Cyrius, https://www.nature.com/articles/s41397-020-00205-5). How does the targeted long read based sequencing compare to the Illumina based allelotyping? One advantage of the Illumina WGS based genotyping is that copy number does not need to measured separately using qPCR. Table 1, the authors list previous studies for CYP2D6 typing. The Buermans et al. study is listed as having the pipeline available. It would be useful to analyze the data generated in this study using this pipeline for direct comparison.

In
3. Along the same lines as (2) above, the authors could analyze previously published targeted long read data for CYP2D6 to demonstrate that their pipeline works for different types of long read amplicon data and sequencing technologies.
Reviewer #2 (Remarks to the Author): Dear authors, I wish to congratulate the authors on this interesting work. The manuscript is well written and the experiments well described and executed. The application of long-read sequencing to these challenging parts of the genome is highly suitable. I have only minor comments. Sincerely,

Wouter De Coster
Page 4, line 107 mentions the accuracy of long-read sequencing platforms. These error rates are, at least for ONT, highly outdated, and 2D consensus sequencing has also been deprecated since 2017. The company claims the modal accuracy of the latest base callers and chemistry is at 99%, but conservatively ~5% would be a reasonable estimate of currently produced data. Finding citable references for such a moving target is challenging, but https://www.nature.com/articles/s41588-021-00865-4 reports a median accuracy of 11.6%. Better results are obtained with the 1D^2 method: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02255-1 On page 14, line 337 recommendations are made for the number of reads that are necessary for the identification of fusion alleles and variant phasing. It is however not clear to me how these numbers were determined, and if a downsampling analysis was performed to obtain these estimates.
The authors used GATK HaplotypeCaller for variant calling, which may be appropriate for highquality HiFi reads, but I wonder about the reason to not use a variant caller tailored to long-read sequencing, e.g. DeepVariant, Clair, or Longshot? While this does probably not affect the results in a major way, some of these tools additionally leverage the phasing of variants or are more appropriate for the error types in long-read sequencing, and I believe it would be relevant to comment on the choice of variant caller.
The authors describe a pipeline for accurate allelotyping of the CYP2D6 gene using long read amplicons. CYP2D6is an important locus due to its relevance in metabolism of drugs and the presence of a pseudo-gene complicates the process of sequencing this locus. There have been many published papers on genotyping CYP2D6 using long read sequencing technologies (as summarized by the authors in Table 1). The strengths of this paper -as summarized by the authors -are a robust data processing pipeline and the large number of samples that they genotyped. Overall, the paper is well written and easy to follow. Since this problem has been tackled by several groups, I think that the authors could have done additional analysis in order to distinguish their work from previously published work and demonstrate the broad utility and robustness of their pipeline.
1. There have been several pipelines developed for genotyping CYP2D6 from Illumina WGS data (e.g. Cyrius, https://www.nature.com/articles/s41397-020-00205-5). How does the targeted long read based sequencing compare to the Illumina based allelotyping? One advantage of the Illumina WGS based genotyping is that copy number does not need to measured separately using qPCR.  We thank the reviewer for this comment, and we agree that Cyrius in particular appears to be a good approach to genotyping CYP2D6 genotypes from short-read WGS. While Cyrius has the advantage of not requiring a separate qPCR assay, we note that Cyrius does not have any means of typing novel CYP2D6 alleles, and running short-read WGS in place of multiplexed long-read amplicon sequencing for CYP2D6 genotypes alone is not cost effective. We have added this information and the reference to Cyrius in the introduction (line 96-98). Table 1, the authors list previous studies for CYP2D6 typing. The Buermans et al. study is listed as having the pipeline available. It would be useful to analyze the data generated in this study using this pipeline for direct comparison.  We thank the reviewer for this comment. The "pipeline" used in the Buermans et al. study we noted as being available is PacBio's long amplicon analysis protocol, and only carries out a portion of the analysis steps that are part of our much more comprehensive pipeline (essentially clustering/phasing and chimera removal, but not allele-typing). We have updated Table 1 with the column "End-to-end Pipeline available" to make it clearer that no previous publications have released an end-to-end pipeline. We do agree that it would be useful to have a benchmark of a pipeline used by another study, so we have implemented as close as possible the complete pipeline used by Buermans et al (which is only a subset of our more complete end to end pipeline) and have tested our data using this pipeline. We have updated the results section at lines 261-264 reflecting this benchmark, as well as the methods section at lines 553-562.

In
3. Along the same lines as (2) above, the authors could analyze previously published targeted long read data for CYP2D6 to demonstrate that their pipeline works for different types of long read amplicon data and sequencing technologies.  We thank the reviewer for this comment. The pipeline as it exists now is specific to PacBio data. In particular the pre-processing stage uses PacBio specific tools (ccs, lima). However, the pipeline has been released as free and open-source software, and it would be feasible for a separate pre-processing module to be written for other sequencing technologies. Additionally, no other study to our knowledge has publicly released PacBio CYP2D6 sequencing data.