Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Methylation-based enrichment facilitates low-cost, noninvasive genomic scale sequencing of populations from feces

### Subjects

This article has been updated

## Abstract

Obtaining high-quality samples from wild animals is a major obstacle for genomic studies of many taxa, particularly at the population level, as collection methods for such samples are typically invasive. DNA from feces is easy to obtain noninvasively, but is dominated by bacterial and other non-host DNA. The high proportion of non-host DNA drastically reduces the efficiency of high-throughput sequencing for host animal genomics. To address this issue, we developed an inexpensive capture method for enriching host DNA from noninvasive fecal samples. Our method exploits natural differences in CpG-methylation density between vertebrate and bacterial genomes to preferentially bind and isolate host DNA from majority-bacterial samples. We demonstrate that the enrichment is robust, efficient, and compatible with downstream library preparation methods useful for population studies (e.g., RADseq). Compared to other enrichment strategies, our method is quick and inexpensive, adding only a negligible cost to sample preparation. In combination with downstream methods such as RADseq, our approach allows for cost-effective and customizable genomic-scale genotyping that was previously feasible in practice only with invasive samples. Because feces are widely available and convenient to collect, our method empowers researchers to explore genomic-scale population-level questions in organisms for which invasive sampling is challenging or undesirable.

## Introduction

The past decade has witnessed a rapid transformation of biological studies with the continuing development and adoption of massively parallel sequencing technology. This sequencing revolution, however, has thus far had a relatively muted impact on studies of wild nonmodel organisms due largely to the difficulty of obtaining high-quality samples. This problem is particularly salient for endangered animals, cryptic animals, or animals for which it is otherwise difficult, undesirable, or unethical to obtain samples invasively.

Field researchers working with wild animals have explored several noninvasive sample types for DNA analysis including feces, hair, urine, saliva, feathers, skin, and nails1. Of these, feces may be the most readily available in many taxa2. Indeed, since PCR amplification of DNA from feces was first demonstrated in the 1990s3, noninvasive genetic studies from feces have revolutionized our understanding of the evolution, population structure, phylogeography, and behavior of nonmodel organisms. PCR amplification, however, is effective only for short sequences of DNA. The ability to generate cost-effective genomic-scale data of animals from feces using massively parallel sequencing would therefore constitute an important methodological advance towards bringing a greater number of wild organism studies into the genomic age.

Feces presents significant challenges for genetic analysis. DNA in feces is often fragmented and low in quantity. Fecal DNA extractions are further characterized by a frequent presence of co-extracted PCR inhibitors, sometimes complicating PCR detection of genotypes1, particularly with long amplicons. Finally, endogenous (host) DNA in feces constitutes a very low proportion, typically less than 5%4,5,6, of total fecal DNA. Instead, fecal DNA contains a preponderance of DNA from exogenous (non-host) sources such as gut microbes, digesta, intestinal parasites, and environmental organisms. Gut bacteria pose a particular challenge as they account for the highest proportion of DNA in feces4,5.

Because of the high representation of exogenous DNA in feces, shotgun sequencing of fecal DNA would yield only a small proportion of reads matching the host genome. For genomic studies of host organisms, particularly those targeting populations, this represents a crippling obstacle in the presence of typical financial constraints. Without an effective enrichment procedure, sequencing of fecal DNA would be less efficient than that of invasively obtained “high-quality” DNA by at least one order of magnitude regardless of improvements in sequencing throughput or cost.

Attempts to enrich host DNA from feces for genomic analysis5,6 have thus far employed targeted sequence capture methodologies. Sequence capture, like PCR, enriches DNA based on sequence specificity but unlike traditional PCR can work at any scale from a single locus7 to a whole genome6,8,9. This method involves hybridizing DNA or RNA “baits,” either affixed to an array10,11 or to magnetic beads in solution12, to a mixture of target and nontarget sequences, thereby capturing targeted DNA from the mixture. Sequence capture has been used for instance to enrich human exomes13, reduced-representation genomes14,15,16, host DNA from ancient or museum specimens9,17,18,19, and pathogen genomes from human clinical samples8. While the cost of custom oligonucleotide bait synthesis remains high, methods for transcribing custom baits from existing DNA templates8,9 have driven costs significantly down, increasing sequence capture’s appeal.

Perry et al.5 first successfully enriched host DNA from feces at the genomic scale. Using a modified sequence capture employing custom-synthesized baits, they were able to highly enrich 1.5 megabases of chromosome 21, the X chromosome, and the mitochondrial genome from fecal samples of 6 captive chimpanzees. Their protocol, however, remains prohibitively expensive for population-level analysis due to the high cost of bait synthesis. More recently, Snyder-Mackler et al.6 performed whole-genome capture on fecal DNA, using RNA baits transcribed in vitro from high quality baboon samples to enrich host genomes from 62 wild baboons. Resulting libraries were sequenced to low coverage (mean 0.49×), but nevertheless provided sufficient information for reconstructing pedigree relationships.

Despite these methodological advances, targeted sequence capture has distinct drawbacks. To avoid the high cost of bait synthesis, RNA baits must first be transcribed from high-quality genomic DNA that is consumed by the process, limiting its appeal when working with species for which high-quality DNA is difficult to obtain or in short supply. The processes of both bait generation and hybridization with fecal DNA are labor-intensive and time-consuming, with the hybridization including an incubation step that alone takes 1–3 days6. Because both RNA baits and the gDNA used to transcribe them are eventually depleted, the composition of RNA baits varies between bait sets, potentially impeding comparison of samples sequenced using different RNA baits and gDNA templates. Trans genomic captures (i.e. capturing DNA using baits from a different species) may complicate enrichment and introduce at least some capture biases20, which will be a particular impediment for genomic studies for which high-quality DNA from related taxa is not accessible. Sequence capture may also introduce biases toward the capture of low-complexity, highly repetitive genomic regions, as well as an excess of fragments from the mitochondrial genome6,9,21.

We have developed a method that makes noninvasive population genomics economically and practically feasible for the first time, by exploiting natural, evolutionarily ancient differences in CpG-methylation densities between vertebrate and bacterial genomes to enrich the host genome from feces. This method, which we call FecalSeq, uses methyl-CpG-binding domain (MBD) proteins to selectively bind and isolate DNA with high CpG-methylation density. Modified after techniques to enrich the microbiome from vertebrate samples22, our method employs a bait protein created by genetically fusing the human methyl-CpG binding domain protein 2 (MBD2) to the Fc tail of human IgG1. The resulting MBD2-Fc protein is then bound by a paramagnetic Protein A immunoprecipitation bead to create a complex that selectively binds double-stranded DNA with 5-methyl CpG dinucleotides. Because vertebrate DNA contains a high frequency of methylated CpGs23,24 while bacterial DNA does not25,26, this MBD bait complex selectively binds host DNA (Fig. 1). This enrichment method is inexpensive and, crucially, captures target DNA without modification, thereby enabling downstream library preparation techniques including complexity reduction-based sequencing methods such as RADseq, which we validate in this study by preparing and sequencing double-digest RADseq libraries27. Because of these properties, our method is well-suited for population genomic studies requiring high sequencing coverage, including those of nonmodel organisms for which few resources (e.g., high-quality samples or reference genomes) exist.

## Results

Our enrichment approach captures eukaryotic DNA using a methylated CpG binding domain protein fused to the Fc fragment of human IgG (MBD2-Fc) to selectively target sequences with high CpG methylation density22.

To evaluate our approach, we enriched DNA extractions from the feces of 6 captive and 46 wild baboons, which we then used to prepare and sequence ddRADseq libraries. We also prepared ddRADseq libraries from blood-derived genomic DNA of all six captive baboons to facilitate controlled (same-individual) comparisons of blood and fecal libraries. All libraries were sequenced using Illumina sequencing.

Quantitative PCR estimates of starting host DNA proportions in fecal DNA extracts ranged widely, but were substantially lower in samples obtained from the wild (captive samples: mean 5.3%, range <0.01–17.4%; wild samples: mean 0.6%, range <0.01–4.9%; Supplemental Tables S1S2).

Based on two pilot libraries constructed from MBD-enriched fecal DNA, we found that there was large variation in the proportion of reads mapping to the baboon reference genome (mean 24.8%, range 0.7–81.2%; Supplemental Fig. S1; Supplemental Table S3), with the read mapping proportion correlating with starting host DNA proportions as estimated via qPCR (library A: r2 = 0.7338; p = 0.03; library B: r2 = 0.9127, p < 0.01). Endogenous DNA proportions on average increased 13-fold as estimated via comparison of pre-enrichment host proportion (from qPCR) and post-enrichment proportion of reads mapped (range 4.4–29.6; two samples removed due to starting proportions too low to quantify).

While some samples in our pilot libraries had high host DNA proportions following enrichment, these samples tended to already have high host DNA proportions prior to enrichment. Host DNA proportions following enrichment in the pilot libraries averaged only 4%, for instance, when samples with starting host DNA proportions greater than 1% were excluded. Because wild fecal DNA samples in our dataset on average started with less than 1% host DNA, we undertook a series of protocol optimization experiments to maximize the enrichment of these “low-quality” samples (Supplemental Tables S4S7).

Using a revised protocol based on our optimization experiments (Supplemental Protocol), we created and sequenced a third library from MBD-enriched fecal DNA. After noting substantial improvements in enrichment, we finally sequenced a fourth library with MBD-enriched fecal DNA from 40 wild baboons.

Despite having similar or even lower starting host DNA proportions, read mapping proportions in the third library were substantially higher than the prior two (mean 49.1%, range 8.9–75.3%; Fig. S3; Supplemental Table S3). Endogenous DNA proportions on average increased 318-fold (range 4.3–2632.2; one sample removed due to starting proportion too low to quantify).

The fourth library consisting entirely of fecal DNA from wild animals had the lowest starting concentrations of host DNA (mean 0.3%, range < 0.01–3.1%). Following enrichment, however, host DNA proportions were nonetheless higher than our pilot libraries (mean 28.8%, range 1.5–73.6%; Supplemental Fig. S1; Supplemental Table S3). Endogenous DNA proportions on average increased 195-fold (range 23.7–486.9).

Overall, the revised protocol produced substantially higher enrichment, measured as fold increases in the proportion of host DNA, particularly for samples with very low starting proportions of host DNA (Fig. 2). While we sometimes were forced to use multiple rounds of extraction, thereby introducing variation in starting host proportions across same-individual trials, the revised protocol nonetheless exhibited robust improvement in read mapping proportions even when starting host proportions were substantially lower.

The distribution of blood- and fecal-derived reads did not differ significantly in the length of RADtags, the GC percentage, or the local CpG density, defined as the number of CpG sites in a region ± 5,000 bp from the boundaries of a RADtag (Wilcoxon rank sum tests, p > 0.99 for all three tests; Supplemental Fig. S2).

MBD binding may in principle select for genomic regions with relatively high CpG-methylation density, leading to dropout of other loci. Assessment of the concordance between blood- and feces-derived reads from the same individual was complicated by the correlation in ddRADseq between total reads and expected RADtags recovered and thereby SNPs discovered: a given RADtag is sequenced at a frequency inversely proportional to the deviation of its length from the mean of the size selection. Thus, we had to discern between dropout due to coverage-related stochasticity inherent in ddRADseq27 and that due to MBD enrichment. To perform this comparison, we computed the proportion of unique alleles between blood- and feces-derived RADtags from the same individual. For this test, we controlled for variation in sequencing coverage by randomly sampling reads as necessary in order to equalize total coverage among same-individual samples. Allelic dropout due to MBD enrichment would result in a higher proportion of alleles unique to blood-derived libraries relative to feces-derived libraries. We did not find a significant discrepancy (multi-sample-called SNPs: mean proportion unique alleles in blood = 2.3%, mean proportion unique alleles in feces = 2.3%; Wilcoxon signed rank test, p = 0.97; Fig. 3A).

Dropout of entire RADtags is easily detectable given a reference genome or sufficient samples for comparison; dropout of a single allele at heterozygous sites is a more insidious potential bias. Allelic dropout due to MBD enrichment would result in a decrease in heterozygosity in MBD-enriched fecal libraries. Inbreeding coefficients (F) computed from same-individual RADtags exhibited in some cases higher values for feces-derived samples (Fig. 3B). This difference, however, was not statistically significant (mean Fblood = 0.63; mean Ffeces = 0.71; Wilcoxon signed rank test, p = 0.47), indicating low allelic dropout attributable to the MBD enrichment. For this test, we also controlled for variation in sequencing coverage as described above.

As investigations of population structure are one potential application of our method, we visualized the wild and captive baboons’ identity-by-state via multidimensional scaling (MDS) using PLINK28,29, and confirmed that individuals clustered by their known species or ancestry and that blood- and feces-derived reads from the same individual were close together in the MDS space (Fig. 3C). The results of this “sanity check” are unsurprising, as variance in samples encapsulated by the first MDS components is expected to reflect population and species membership.

Stringent filtration of SNP sets, as would be implemented in a standard population genetic study, reduced the apparent biases attributable to fecal enrichment, measured both as total SNPs with a significant association with sample type (unfiltered: 25,079 out of 591,726, or 4.2%; filtered: 13 out of 7,202, or 0.2%) as well as total SNPs with significant missingness assessed via a chi-square test (unfiltered: 69,753 out of 550,224, or 12.7%; filtered: 0 out of 5,602, or 0%). Though more work is needed to quantify more exactly the extent and causal factors that lead to missingness, many population genetic analyses are robust to the low level of dropout our analyses reveal in addition to that which is inherent in the RADseq family of techniques30.

## Discussion

Our methylation-based capture method achieves substantial enrichment of host DNA from fecal samples. Using our revised protocol developed through experimentation, we produced a mean 195-fold enrichment on our final library consisting entirely of fecal DNA obtained noninvasively under remote field conditions, with most samples nearly a decade old. A mean 28.8% of reads mapped to the baboon genome, despite starting with only a mean 0.34% of host DNA. Using fecal and blood DNA obtained from captive animals, we further demonstrate that feces-derived genotyping data following our method are concordant with corresponding data obtained from blood.

Feces are among the most readily accessible sources of information on wild animals1, and are particularly useful for population-level studies or studies of endangered or elusive species for which obtaining high-quality samples is difficult or undesirable. By exploiting methylation differences rather than sequence differences between host and bacterial DNA, FecalSeq is an enrichment strategy that requires neither prior genome sequence knowledge nor the use of high-quality DNA for preparation of capture baits. This results in enrichment which is both inexpensive and replicable. The enrichment procedure is also relatively rapid and uncomplicated. Using a 96-well plate, we performed two sequential rounds of enrichment on all forty samples in our final library within a day (see Supplemental Protocol).

Compared to comparable experiments using high-quality DNA samples such as blood, our enrichment method introduces extremely low added costs. After excluding shared costs such as DNA extraction, library preparation, and sequencing, major costs associated with our method are qPCR reagents for initial quality assessment of fecal DNA samples and enrichment reagents for capturing the host genome. qPCR reagents cost about $0.60 USD per reaction (or$1.20 per sample assuming samples are run in duplicate). For our enrichment protocol, the amount of reagents used will vary based on the starting proportion of host DNA in the sample (see Supplemental Protocol). Assuming fecal DNA samples on average contain 2.5% host DNA, a single enrichment kit will support a total of 240 enrichment reactions at $0.70 per sample. Based on our experience, most fecal DNA samples contain less than 2.5% host DNA and will therefore require less reagents, further lowering the cost per sample. Following enrichment, we purified DNA using homemade SPRI beads31 which add a very low cost per sample (about$0.10 per sample).

## Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

## About this article

### Cite this article

Chiou, K.L., Bergey, C.M. Methylation-based enrichment facilitates low-cost, noninvasive genomic scale sequencing of populations from feces. Sci Rep 8, 1975 (2018). https://doi.org/10.1038/s41598-018-20427-9

Download citation

• Received:

• Accepted:

• Published:

• DOI: https://doi.org/10.1038/s41598-018-20427-9

## Further reading

• ### Whole genome sequences from non-invasively collected caribou faecal samples

• Rebecca S. Taylor
• Micheline Manseau
• Paul J. Wilson

Conservation Genetics Resources (2021)

• ### Faecal DNA to the rescue: Shotgun sequencing of non-invasive samples reveals two subspecies of Southeast Asian primates to be Critically Endangered species

• Andie Ang
• Dewi Imelda Roesma
• Rizaldi

Scientific Reports (2020)

• ### A new non-invasive in situ underwater DNA sampling method for estimating genetic diversity

• Gergely Balázs
• Judit Vörös
• Gábor Herczeg

Evolutionary Ecology (2020)

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

## Search

### Quick links

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing