Supplementary Figure 1: Parameters used by CATCH in default model of hybridization. | Nature Biotechnology

Supplementary Figure 1: Parameters used by CATCH in default model of hybridization.

From: Capturing sequence diversity in metagenomes with comprehensive and scalable probe design

Supplementary Figure 1

CATCH models hybridization between each candidate probe and the target sequences. Doing so allows CATCH to decide whether a candidate probe captures (or ‘covers’) a region of the target sequence, and thus find a probe set that achieves a desired coverage of the target sequences under this model. For whole genome enrichment, the desired coverage would typically be 100% of each target sequence. (a) Relatively conserved regions (for example, a particular gene) in the input sequences can be captured with few probes because it is likely that any given probe, under a model of hybridization, will capture observed variation across many or all of the input sequences. Highly variable regions may require many probes to be captured because each given probe may capture the observed variation across only a small fraction of the input sequences. (b) By default, CATCH decides whether a probe hybridizes to a region of a target sequence according to the following parameters: a number m of mismatches to tolerate and a length lcf of a longest common substring. CATCH computes the longest common substring with at most m mismatches between the probe and target subsequence, and decides that the probe hybridizes to the target if and only if the length of this is at least lcf. If the parameter i is provided, CATCH additionally requires that the probe and target subsequence share an exact (0-mismatch) match of length at least i. If CATCH decides that the probe hybridizes to the subsequence of the target with which it shares a substring, then it determines that the probe captures the region equal to the length of the probe as well as e nt on each side of this region. e, termed a cover extension, is a parameter whose value can be specified to CATCH, along with m, lcf, and i. Lower values of m, higher values of lcf, higher values of i, and lower values of e are more conservative and lead to more probe sequences. (For details, see the description of fmap in Online Methods.) (c) Number of probes required to fully capture 300 genomes of HCV, HIV-1, EBOV, and ZIKV, for varying values of the mismatches and cover extension parameters, with other parameters fixed. Shaded regions are 95% pointwise confidence bands calculated across randomly sampled input genomes.

Back to article page