Cat-D: a targeted sequencing method for the simultaneous detection of small DNA mutations and large DNA deletions with flexible boundaries

We developed a targeted DNA sequencing method that is capable of detecting a comprehensive panel of DNA mutations including small DNA mutations and large DNA deletions with unknown/flexible boundaries. The method directly identifies the large DNA deletions (Cat-D) without relying on sequencing coverage to make the genotype calls. We performed the method to simultaneously detect 10 small DNA mutations in β-thalassemia and 2 large genomic deletions in α-thalassemia from 10 genomic DNA samples. Cat-D was performed on 8 genomic DNA samples in duplicate. The 18 Cat-D samples were combined in one sequencing run. In total, 216 genotype calls were made, and 215 of the genotype calls were accurate. No false negative genotype calls were made. One false positive genotype call was made on one target mutation in one experimental duplicate from a genomic DNA sample. In summary, Cat-D can be developed into a robust, high-throughput and cost-effective method suitable for population-based carrier screens.

malfunctions and result in mild or severe anemia. However, the same defect also provides a degree of protection against malaria. The selective survival advantage of heterozygous carriers is believed to be responsible for perpetuating the mutations in human populations 6 . Thalassemia is one of the most common genetic disorders worldwide, posing an important public health problem in Southeast Asia, the Mediterranean region, the Middle East and sub-Saharan Africa 5 . Approximately 18% of the population in Guangxi province (China) 7 and 3% of the Singaporean population (https://www.kkh.com.sg/HealthPedia/Pages/PregnancyPlanningForBabyThalassaemia. aspx) are carriers of thalassemia mutations. In contrast to the point mutations commonly seen in β-thalassemia 8,9 , the common mutations found in α-thalassemia are a series of large DNA deletions (~3-40 kb) 10 ( Supplementary  Fig. S1). Although the carrier rate for thalassemia mutations is extraordinarily high, a population-based carrier screen is difficult to perform. The experimental techniques being used in clinical labs for detecting large DNA deletions in thalassemia 10 , such as gap-PCR, are low throughput (one test for one patient sample) and not comprehensive (one test for one specific mutation). These techniques are only used for patient DNA diagnosis and are unsuitable for population-based carrier screens. It is worth noting that alternative sequencing approaches, such as Nanopore sequencing 11 and paired-end long-insert Illumina sequencing 12 , are methods capable of detecting large genomic DNA deletions. However, neither method is a targeted sequencing method. Both methods require a suitable target enrichment step if they are to be used for population-based mutation carrier screens. Moreover, both methods are not suitable for the clinical detection of small DNA mutations. Illumina paired-end sequencing is not cost-efficient, as paired-end sequencing is not necessary for the detection of small DNA mutations. For Nanopore sequencing, its high sequencing error rate 11 makes it very difficult to apply the method for DNA mutation detection, especially for small DNA mutations.
The strength of padlock capture is to detect small DNA mutations such as SNPs (single-nucleotide polymorphism). It is straight forward to design a padlock probe library targeting a panel of small DNA mutations. However, the panel cannot include thalassemia DNA deletions, which is one of the most common mutations in human genetic disorders. The length of the DNA region captured by a padlock probe is restricted by the length limits of the synthesized padlock probes 13 . For a large DNA deletion with flexible or unknown deletion boundaries, it is difficult and unreliable to design a padlock probe to directly capture the junction region of the deletion (Fig. 1B, the "impossible" design). Alternatively, a series of padlock probes can be designed to cover the deleted region (Fig. 1B, the "Kebab" design). One can imagine that these padlock probes bind to the template DNA and form a "Kebab" shape. Therefore, we named these padlock probes Kebab probes. Kebab probes return negative results from homozygous mutants. However, they cannot help to distinguish heterozygous mutants from the wild type, which is the most important genotyping information for a population-based carrier screen. Taken together, the large genomic deletions observed in thalassemia represent a special type of mutation that is frequently observed in human genetic disorders but is difficult to detect using conventional sequencing approaches.

Results
Experimental design of Cat-D. We developed a method of using padlock probes to positively "catch a large deletion" (Fig. 1B, the "Cat-D" design). The method does not rely on a negative readout to "detect" the deletion. It also does not rely on using sequencing data to reveal the "gene copy number variation". In Cat-D, the first step is a PCR reaction (Fig. 1B, pre-PCR). A pair of PCR primers is designed to amplify the DNA region surrounding the deletion. Because of the flexible PCR amplicon length, designing the PCR primers does not depend on knowing the exact deletion boundaries. Only the mutant allele carrying the large DNA deletion can be amplified. The wild type allele is not PCR-amplified because the deletion size is too large to allow the primer pair to work along the wild type allele. The basic concept of the pre-PCR in Cat-D is the same as a commonly used technique called gap PCR. In contrast to gap PCR, one of the two pre-PCR primers in Cat-D carries an adaptor sequence on its 5′-end ( Fig. 1B, labeled in red). The adapter sequence is artificially designed to ensure the sequence does not exist in the human genome. The adaptor complementary strand is produced only if the PCR works. Because padlock capture is strand-specific, a special padlock probe, the "Cat-D probe" (Fig. 1B), can be designed to capture the pre-PCR product with its extension arm targeting the adaptor complementary strand. The Cat-D probe only works if the PCR works. To avoid detecting the noise associated with non-specific primer binding, which may occur during a PCR reaction, the ligation arm of the Cat-D probe is designed to capture the DNA region immediately downstream of the pre-PCR primer. In summary, genotype calls for large deletions can be made by the padlock capture results from Cat-D probes together with Kebab probes (Fig. 1C).
To catch multiple large deletions, multiple primer pairs targeting different deletions can be included in one pre-PCR reaction. Each primer pair targets one deletion and provides one unique adaptor sequence for designing the corresponding Cat-D probe. There is no restriction to the amplicon size of each primer pair. The amplicon sizes of different primer pairs can be similar or different. The pre-PCR product is subjected to the padlock capture of a probe library, which includes Cat-D probes and other padlock probes targeting a comprehensive panel of DNA mutations.
Pre-PCR cycle optimization and test run setup. The pre-PCR product is subjected to padlock capture as a downstream assay. Therefore, the pre-PCR does not have to be completed with full PCR cycles. We first performed gap PCR and successfully detected two thalassemia deletions from patient genomic DNA samples ( Fig. 2A). Interestingly, the PCR amplicon sizes from the patient sample (Coriell Biorepository GM10796) were ~1 kb longer than the PCR amplicon sizes estimated based on a previous publication 14 (Fig. 2B). This result further confirmed that the deletion boundaries vary among patient samples. The number of pre-PCR cycles required for Cat-D was then tested. --FIL was successfully detected by Cat-D with a minimum of 16 pre-PCR cycles (Fig. 2C).
We generated a padlock probe library containing 5 padlock probes targeting the Cat-D product of --FIL, 5 padlock probes targeting the Cat-D product of --SEA, 17 Kebab probes targeting the commonly deleted regions in --FIL and --SEA, and 9 padlock probes targeting 10 different small β-thalassemia DNA mutations.
We performed a test run on a collection of 10 human genomic DNA samples (Fig. 2D). This study was approved by the Ethics Committee of Nanyang Technological University. Padlock capture was performed on each sample in duplicate. Two genomic DNA samples from two commonly used human cancer cell lines (293 T and HeLa) are regarded as "wild type" samples, as the samples were tested as "wild type" for all the thalassemia mutations included in this study (data not shown). Six α-thalassemia genomic DNA samples and one β-thalassemia genomic DNA sample were included. A special human DNA sample was purchased from Promega (Cat# G304A). The sample was originally included in this study as a wild type control. However, we later realized that Promega (Cat# G304A) is prepared from human whole blood from multiple anonymous donors. The blood samples are only tested as negative for HIV and Hepatitis B. There is no information available regarding the samples' genotypes for thalassemia mutations. Therefore, G304A should be regarded as a special DNA sample without a clear genotype. We included G304A in this study just for the test run. Moreover, our padlock capture duplicates on the sample (G304A.1 and G304A.2) were performed on G304A from two different lots (G304A.1 LOT0000189195; G304A.2 LOT0000219766). Therefore, G304A.1 and G304A.2 should be considered two different DNA samples.
On average, ~184 K reads were obtained from each sample. To confirm the experimental consistency of the method, we calculated the correlation coefficients between the duplicates in each sample. The correlation coefficient of the eight experimental duplicates was 0.98 ± 0.01 ( Supplementary Fig. S2). This result confirmed the high experimental consistency of the method.
Large α-thalassemia DNA deletions detected by Cat-D. The raw data (Fig. 3A) clearly showed that the padlock capture products from the Cat-D probes are significantly higher in the samples carrying the corresponding deletions. The headcounts of the Kebab probe capture products are also significantly lower in the samples containing the compound heterozygous deletion (--FIL/--SEA).
To provide a mathematical justification and generate a computational method to make genotype calls, we established a mathematical method to calculate the genotype scores and make the genotype calls for each sample ( Fig. 3B; Methods). The results are nearly picture-perfect for --FIL and Kebab (Fig. 3C,E). Negative genotype calls were accurately made on all the wild type samples and the samples expected to be wild type; for example, the β-thalassemia samples (Beta.1 and Beta.2) are expected to be wild type for the α-thalassemia mutations. Positive genotype calls were also accurately made on all the mutant samples. Clear genotype calls were also made for --SEA (Fig. 3D). All the mutant and wild type samples were accurately genotyped. For the samples "expected" to be wild type, G304A.Lot2 and Beta.1 were genotyped as positive for --SEA (Fig. 3D). G304A is a mixture of genomic DNA isolated from multiple donors, and no information is available regarding the sample's genotype regarding thalassemia mutations. Based on our genotyping results, it is highly likely that one or more G304A. Lot2 donors are carriers of --SEA. We further confirmed this conclusion by gap PCR (Supplementary Fig. S3). Interestingly, all the genomic DNA samples were subjected to gap PCR before Cat-D to confirm the samples' genotype for the α-thalassemia mutations ( Supplementary Fig. S3A). Each PCR, which contained 100 ng of genomic DNA, was performed for 35 cycles. --SEA was not detected in G304A.Lot2. When gap PCR was repeated with 38 cycles and 200 ng genomic DNA, a clear PCR product for --SEA was detected in G304A.Lot2. This result confirmed the Cat-D genotyping results and showed that Cat-D is more sensitive than gap PCR. For Beta.1, the genotype call is a false positive result. This false positive result can be dealt with by comparing it with the genotype call made on the duplicate sample (Beta.2).

β-thalassemia point mutations detected by padlock probes. The Cat-D and Kebab probes only
occupy a small fraction of the padlock probe library, which also includes other padlock probes targeting small DNA mutations, such as SNPs. In this study, we included padlock probes targeting small β-thalassemia DNA mutations. One of the 10 DNA samples included in this study is a heterozygous mutant in β-thalassemia codon 17 (A > T). The raw data (Fig. 4A) clearly showed that the mutant headcounts are significantly higher in the samples carrying the corresponding mutation. To provide a mathematical justification and to generate a computational method to make the genotype calls, we established a mathematical method to calculate the genotype call (Fig. 4B). In this case, we simply choose 5% as the threshold to make the genotype call for a "minor allele" (Fig. 4B; Methods). The 5% minor allele frequency was determined by analyzing the padlock capture data (Fig. 4C). We calculated the genotype scores and made the genotype calls on all the samples (Fig. 4D). The results show that the method is sensitive and precise for β-thalassemia point mutations. We also included padlock probes targeting other β-thalassemia small mutations in the padlock probe library. Because we do not have mutant genomic DNA samples for these mutations, we expected that all the samples included in this study are wild type for these mutations. Our genotyping results clearly confirmed our expectations ( Supplementary Figs S4 and S5).

Discussion
In summary, the test run yielded highly satisfying results and a strong proof of concept for Cat-D. These results demonstrate that the method is sensitive (0% false negative rate) and precise (very low false positive rate, ~5% for --SEA mutation). From a clinical point of view, a low false positive rate is more "acceptable" than a low false negative rate. When genetic testing is performed on a large population, the majority of the samples are wild type. With a 0% false negative rate, all the wild type samples can be accurately genotyped and patients can be informed of their testing results with confidence. Regardless of the false positive rate of the experimental method, for the minority of the samples that tested positive for a certain mutation, this is a feasible approach for clinical labs to experimentally validate testing results before "bad news" is released to patients. Taken together, Cat-D is a comprehensive (a single test covers a comprehensive panel of genetic disorders) and high-throughput (one sequencing run contains multiple samples) method suitable for population-based carrier screens.

Methods
Primers design. The primer portion of the pre-PCR primers was designed according to the criteria for designing a regular PCR primer. The primers do not bind to repetitive DNA regions in the genome. The primer pairs were confirmed to be able to amplify the target DNA region using a mutant genomic DNA sample carrying the corresponding deletion. For each pre-PCR primer pair, one of the two primers carries the Cat-D adapter on its 5′-portion. The adaptor sequence does exist in the human genome. The adapter sequence was designed to be at The mathematical method to calculate the genotype scores and to make the genotype calls on SNPs and other small DNA mutations. (C) Allele frequencies of the padlock capture products. To determine the minor allele frequency used in the data analysis, we calculated the allele frequencies of all the nucleotide positions captured by one padlock probe. The first 20 nucleotides of each sequencing read belong to the ligation arm. The padlock captured region is located between the 21 st nucleotide and the 67 th nucleotide. For each nucleotide position, we calculated the allele frequency of A, T, C and G. Five percent was selected as the threshold for the minor allele frequency in the data analysis. The position of the β-thalassemia point mutation, codon 17 (A > T), is marked by the red circle. (D) Genotype scores. Data analysis. We wrote a perl script to find the exact match between the first 88 nt of a sequencing read and an expected padlock probe capture product. To make the genotype calls on large DNA deletions using the padlock capture data from the Cat-D and Kebab probes, a "standard weight" was calculated for each mutation by taking the average headcounts from the four wild type samples (293 T.1, 293 T.2, HeLa.1 and HeLa.2). The raw genotype score of each sample was then calculated as the headcount of the sample divided by the standard weight. Because Kebab probes "negatively" report the corresponding mutation (homozygous deletion), low headcounts indicate the detection of a mutation. Therefore, the raw genotype scores of the Kebab probes were calculated in reverse (standard weight divided by the headcount of each sample). To make the genotype scores more sensible for interpretation, the sample with the highest raw genotype score in the panel was scored as 100. The rest of the samples were scored proportionally to the raw genotype scores. The threshold was then calculated (Fig. 3B). A sample with a genotype score higher than the threshold was positive for the corresponding mutation. The corresponding mutation with the Cat-D probes is a corresponding large DNA deletion. The corresponding mutation with the Kebab probes is a "homozygous" large DNA deletion.
To make the genotype calls on the point mutations, we used 5% as the threshold to make the genotype call on a "minor allele" (Fig. 4B).