Biogeography and individuality shape function in the human skin metagenome

Journal name:
Nature
Volume:
514,
Pages:
59–64
Date published:
DOI:
doi:10.1038/nature13786
Received
Accepted
Published online

Abstract

The varied topography of human skin offers a unique opportunity to study how the body’s microenvironments influence the functional and taxonomic composition of microbial communities. Phylogenetic marker gene-based studies have identified many bacteria and fungi that colonize distinct skin niches. Here metagenomic analyses of diverse body sites in healthy humans demonstrate that local biogeography and strong individuality define the skin microbiome. We developed a relational analysis of bacterial, fungal and viral communities, which showed not only site specificity but also individual signatures. We further identified strain-level variation of dominant species as heterogeneous and multiphyletic. Reference-free analyses captured the uncharacterized metagenome through the development of a multi-kingdom gene catalogue, which was used to uncover genetic signatures of species lacking reference genomes. This work is foundational for human disease studies investigating inter-kingdom interactions, metabolic changes and strain tracking, and defines the dual influence of biogeography and individuality on microbial composition and function.

At a glance

Figures

  1. Multi-kingdom relative abundances are strongly shaped by skin microenvironment.
    Figure 1: Multi-kingdom relative abundances are strongly shaped by skin microenvironment.

    a, Boxplots of mean relative abundance of different kingdoms by site. Black lines indicate median; boxes first and third quartiles. Triangles indicate significance (adjusted P < 0.05, Kruskal–Wallis post-hoc test) for over- (up) or under- (down) representation in a majority of pairwise comparisons between sites. Hp (hypothenar palm), Vf (volar forearm), Ac (antecubital crease), Ic (inguinal crease), Id (interdigital web space), N (nares), Pc (popliteal crease), Ph (plantar heel), Tw (toeweb space), Al (alar crease), Ba (back), Ch (cheek), Ea (external auditory canal), Gb (glabella), Mb (manubrium), Oc (occiput), Ra (retroauricular crease), Tn (toenail). b, Kingdoms in HMP body sites. c, Consensus relative abundance plots of major skin taxa by microenvironment. C., Corynebacterium; P., Propionibacterium; S., Staphylococcus. d, Communities cluster primarily by microenvironment with sebaceous regions most distinct in principal components (PC) analysis.

  2. Individual-specific signatures are typically low abundance but shared across most sites.
    Figure 2: Individual-specific signatures are typically low abundance but shared across most sites.

    Left, variable importance plot of most discriminatory taxa from random forests analysis. For each individual, centre, proportion of the 18 sites in which each taxa is present, and right, mean relative abundance of that taxa across sites.

  3. Propionibacterium acnes and Staphylococcus epidermidis are heterogeneous and multiphyletic at the strain level.
    Figure 3: Propionibacterium acnes and Staphylococcus epidermidis are heterogeneous and multiphyletic at the strain level.

    a, b, Reference genomes used for P. acnes (a) and S. epidermidis (b). Leftmost bar shows subtypes (phylogenetically similar genomes) as colour groups. Adjacent heat map shows mean relative abundance by skin microenvironment. D, dry; M, moist; S, sebaceous; T, toenail. c, d, Select relative abundance plots; strain colours as in a, b. e, f, P. acnes subtypes differ more significantly between individuals than skin microenvironment with the converse observed for S. epidermidis. Boxplots of Yue–Clayton theta indices calculate similarity between (‘inter’) or within (‘intra’) individuals/microenvironments (θ = 1 means identical). Black lines indicate median, boxes show first and third quartiles. P value, Wilcoxon rank-sum test. g, h, Bar charts show P. acnes and S. epidermidis subtypes that differ by microenvironment or individual. Length of bar represents the fraction of post-hoc tests significant for each comparison; 105 comparisons for individual; 6 for microenvironment. *P < 0.05, adjusted Kruskal–Wallis test.

  4. Functional capacity varies by microenvironment.
    Figure 4: Functional capacity varies by microenvironment.

    a, Shannon diversity of functional pathways and taxonomy by site; P value, Kruskal–Wallis test between microenvironments. Error bars, standard error of the mean. b, Microenvironments possess different core modules; ‘core’ means occurrence in more than 2/3 of samples. Error bars show variation within a class of modules (full version in Extended Data) that may arise from a unique specialization for that microenvironment. c, PCA shows clustering by microenvironment, with strong separation of sebaceous, dry and toenail modules. Heat maps: left, loadings for the first two PCs; right, mean relative abundances for modules with the greatest variation by microenvironment. d, A module’s taxonomic origin can be imputed by Spearman correlation (ρ; adjusted P ≤ 2 × 10−16) with P. acnes and M. restricta relative abundances. e, Presence of select antibiotic resistance gene families by individual and site.

  5. Reconstruction of metagenomic dark matter with reference-free methods.
    Figure 5: Reconstruction of metagenomic dark matter with reference-free methods.

    a, Per-sample iterative assembly with variable k-mers (nucleotide words of length k) optimizes assembly quality as assessed by metrics such as % reads mapping back to assembly (left) and the number of bases incorporated (right). Colours are as in c. b, Skin gene catalogue was mapped to the NCBI non-redundant (nr) database and KEGG to identify kingdom and functional category. Density plot compares length of genes with and without homology; gene length was typically larger for unmapped genes. c, Metagenomic clusters represent genes that covary in abundance across samples within a microenvironment; boxplots show cluster sizes; histograms show number of clusters (log10 scale). d, A lowest common ancestor (LCA) was assigned to a cluster with >50% consensus taxonomy. Bar length indicates the total number of ‘genes’ in a cluster; black represents the number of genes mapping to the LCA. Grey represents ambiguous or unannotated genes. ‘Characterized’ indicates that a reference genome exists for that species; for e, ‘Uncharacterized genomes’, no reference exists. Seb, sebaceous; tn, toenail.

  6. The 18 selected skin sites and their location on the human body.
    Extended Data Fig. 1: The 18 selected skin sites and their location on the human body.

    These sites represent three microenvironments: sebaceous (blue), dry (red), and moist (green). Toenail (black) is a site that does not fall under these major microenvironments and is treated separately. Pie charts represent consensus relative abundance of the kingdoms Bacteria, Eukaryota (Fungi), and virus from multi-kingdom mapping.

  7. Per-sample read statistics.
    Extended Data Fig. 2: Per-sample read statistics.

    Additional samples (bacterial and eukaryotic mock communities) are shown. a, Boxplots (line indicates median; boxes represent first and third quartiles) show, for each site, % reads mapping to human hg19 that are discarded before analysis. Sites are coloured by site characteristic. b, Samples are ordered by label. Lines indicate the median value for that statistic; value is in parenthesis. c, Estimate of sequencing coverage. Reads seen is the number of reads in a sample sampled. Reads are then split into 20-mers, compared to a k-mer coverage table and kept only if the median k-mer coverage is below 20×. Curves are grouped by site, coloured by individual as indicated.

  8. Validation of taxonomic classifications.
    Extended Data Fig. 3: Validation of taxonomic classifications.

    a, Bacterial sample community diversity as a function of genome coverage for two diversity metrics, the Shannon index that measures the richness and evenness of the community (left), and number of species observed (right). Genome coverage is defined as for each genome hit, the % of genome covered by reads. Boxplots show the range of diversity values for all samples, segregated by microenvironment. Black lines indicate median; boxes represent first and third quartiles. As coverage cut-offs increase, diversity estimates drop sharply. b, Comparisons of bacterial community diversity for Metaphlan-derived classifications versus custom bacterial Pathoscope-derived classifications. Each point represents a different sample, coloured by microenvironment. With no coverage cut-offs (left), Pathoscope may overestimate diversity, which is reduced by setting a minimum 1× coverage requirement. Spearman correlation (ρ) and corresponding P values are shown. Pathoscope-derived relative abundances versus relative abundances derived from c, 16S amplicon sequencing, d, Metaphlan genus-level, e, Metaphlan-species level (ρ and P value are calculated for non-zero abundance taxa), f, Metaphlan, staphylococcal species, g, ITS1 amplicon sequencing, genus (ρ and P value are calculated for non-zero abundance taxa), and h, ITS1 amplicon sequencing, Malassezia species.

  9. Full taxonomic classifications for all healthy volunteers (HV), all sites.
    Extended Data Fig. 4: Full taxonomic classifications for all healthy volunteers (HV), all sites.

    To aid visualization of site- and individual-specific similarities, samples are grouped by site/microenvironment for each individual. Relative abundances of the most abundant skin taxa for each super-kingdom are shown. b, Taxonomic re-classification of major sites sampled by the Human Microbiome Project. Samples are from the anterior nares and retroauricular crease (skin), tongue dorsum and supragingival plaque (oral), stool, and posterior fornix (vaginal). Relative abundances of the most abundant taxa for each kingdom in the skin, for comparison, are shown.

  10. Strain-level classification based on reference genomes show sub-species heterogeneity for dominant skin taxa.
    Extended Data Fig. 5: Strain-level classification based on reference genomes show sub-species heterogeneity for dominant skin taxa.

    a, Simulations to assess sensitivity of Pathoscope-based mapping to SNPs, non-core regions, or whole genomes. Synthetic communities were created with 6, 12, or 18 genomes per community. Sizes of circles reflect the number of reads sampled from each genome, for example, 50,000, 100,000, or 500,000 reads per genome. 15 random synthetic communities for each genome group were created and mapped to SNPs, non-core regions, or the full genome set. Sensitivity is calculated from the expected versus the observed abundances. b, Full strain-level assignments for samples with relative abundances of closest related Propionibacterium acnes strains, by individual. c, Dendrograms of strain similarity. Trees were generated using core SNPs; genomes were aligned with nucmer to identify core regions, and then SNPs within these core regions were identified by calculating all pairwise differences between genomes. Bar of colours indicates delineations of subtypes where phylogenetically more similar genomes are in similar colours; for example, we defined 12 subtypes for P. acnes.

  11. Strain-level classification for Staphylococcus epidermidis.
    Extended Data Fig. 6: Strain-level classification for Staphylococcus epidermidis.

    a, Full strain-level assignments for samples by microenvironment. b, Description is as in Extended Data Fig. 5c. We defined 14 subtypes for S. epidermidis.

  12. Full version of coreness of different module categories across skin microenvironment.
    Extended Data Fig. 7: Full version of coreness of different module categories across skin microenvironment.

    A module is defined as core if occurring in >2/3 of samples for that class. Major KEGG module descriptors are shown in the different colours. Height of bars reflects the proportion of samples that a module occurs in.

  13. Correlation analysis of module abundance with species abundance to infer a module/'s taxonomic origin.
    Extended Data Fig. 8: Correlation analysis of module abundance with species abundance to infer a module’s taxonomic origin.

    Spearman correlation (ρ) was calculated with corresponding P value for taxa with relative abundance >0.5% and modules with greater than 0.05% relative abundance. Coryn., Corynebacterium. a, Unsupervised clustering of correlation coefficients. Species from the same genera clustering together may suggest a shared contribution of a pathway. b, Most significantly correlated taxa; colours represent broad KEGG classes. Adjusted P < 2 × 10−16.

  14. Antibiotic resistance profiles in the skin.
    Extended Data Fig. 9: Antibiotic resistance profiles in the skin.

    Reads were mapped to a short marker database consensus created from the ARDB database, which catalogues publicly available resistance genes. Genes are grouped into broad resistance classes; a resistance category is called present (black; absent = white) if at least one gene from its family is present.

  15. Reference-free analysis of skin metagenome with adaptive iterative assembly, gene catalogue, and metagenomic clusters.
    Extended Data Fig. 10: Reference-free analysis of skin metagenome with adaptive iterative assembly, gene catalogue, and metagenomic clusters.

    a, Tracking unclassified reads. Fraction unmapped reads refers to the fraction of total reads passing quality control that do not map to the major super kingdoms Archaea, Bacteria, Eukaryota, and viruses. Samples are ordered by label and are divided by site. b, Assembly, gene-calling, and clustering workflow. c, Assembly efficacy varies significantly by k-mer depending on the site’s unique features of community complexity and sequencing depth, which is most affected by that site’s human DNA admixture. Assembly statistics are shown for samples pooled by individual, which produced higher quality assemblies than pooling by site. Because of large pool size, khmer digital normalization was used before Velvet assembly. % overall alignment rate indicates the total % of reads that map back to that sample’s assembly for each k-mer. % paired concordant indicates the fraction paired reads (of overall, not of % paired) in which both pairs of a mate map back to an assembly; discordant is where one mate of a pair does not map, or maps to a different contig. Contigs are then assessed by the maximum assembly size, the number of bases that are used in the assembly, and the number of contigs above a threshold of 300 bp. d, Effect of khmer digital normalization on individual sample assembly. Digital normalization + Velvet assembly performs similarly to Velvet assembly alone.

Accession codes

Primary accessions

BioProject

References

  1. Grice, E. A. & Segre, J. A. The skin microbiome. Nature Rev. Microbiol. 9, 244253 (2011)
  2. Grice, E. A. et al. Topographical and temporal diversity of the human skin microbiome. Science 324, 11901192 (2009)
  3. Costello, E. K. et al. Bacterial community variation in human body habitats across space and time. Science 326, 16941697 (2009)
  4. Findley, K. et al. Topographic diversity of fungal and bacterial communities in human skin. Nature 498, 367370 (2013)
  5. Peters, B. M. & Noverr, M. C. Candida albicans-Staphylococcus aureus polymicrobial peritonitis modulates host innate immunity. Infect. Immun. 81, 21782189 (2013)
  6. Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174180 (2011)
  7. De Vlaminck, I. et al. Temporal response of the human virome to immunosuppression and antiviral therapy. Cell 155, 11781187 (2013)
  8. Handley, S. A. et al. Pathogenic simian immunodeficiency virus infection is associated with expansion of the enteric virome. Cell 151, 253266 (2012)
  9. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 5560 (2012)
  10. Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99103 (2013)
  11. Grice, E. A. et al. A diversity profile of the human skin microbiota. Genome Res. (2008)
  12. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207214 (2012)
  13. Tettelin, H. et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial ‘pan-genome’. Proc. Natl Acad. Sci. USA 102, 1395013955 (2005)
  14. von Eiff, C., Becker, K., Machka, K., Stammer, H. & Peters, G. Nasal carriage as a source of Staphylococcus aureus bacteremia. N. Engl. J. Med. 344, 1116 (2001)
  15. Oh, J. et al. The altered landscape of the human skin microbiome in patients with primary immunodeficiencies. Genome Res. (2013)
  16. Sommer, M. O. A., Dantas, G. & Church, G. M. Functional characterization of the antibiotic resistance reservoir in the human microflora. Science 325, 11281131 (2009)
  17. Forsberg, K. J. et al. The shared antibiotic resistome of soil bacteria and human pathogens. Science 337, 11071111 (2012)
  18. Kong, H. H. et al. Temporal shifts in the skin microbiome associated with disease flares and treatment in children with atopic dermatitis. Genome Res. 22, 850859 (2012)
  19. Jumpstart Consortium Human Microbiome Project Data Generation Working Group. Evaluation of 16S rDNA-based community profiling for human microbiome research. PLoS ONE 7, e39315 (2012)
  20. Howe, A. C. et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl Acad. Sci. USA 111, 49044909 (2014)
  21. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 75377541 (2009)
  22. Matsen, F. A., Kodner, R. B. & Armbrust, E. V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11, 538 (2010)
  23. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357359 (2012)
  24. Francis, O. E. et al. Pathoscope: Species identification and strain attribution with unassembled sequencing data. Genome Res. 23, 17211729 (2013)
  25. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841842 (2010)
  26. Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9, 811814 (2012)
  27. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 2730 (2000)
  28. Abubucker, S. et al. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLOS Comput. Biol. 8, e1002358 (2012)
  29. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 24602461 (2010)
  30. Liu, B. & Pop, M. ARDB—Antibiotic Resistance Genes Database. Nucleic Acids Res. 37, D443D447 (2009)
  31. Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 4550 (2013)
  32. Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307321 (2010)
  33. Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 24782483 (2002)
  34. Namiki, T., Hachiya, T., Tanaka, H. & Sakakibara, Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012)
  35. Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010)
  36. Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006)
  37. van Dongen, S. & Abreu-Goodger, C. in Bacterial Molecular Networks (eds Helden, J., Toussaint, A. & Thieffry, D.) Methods in Molecular Biology Vol. 804, pp. 281295 http://link.springer.com/protocol/10.1007/978-1-61779-361-5_15 (Springer, 2012)
  38. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289300 (1995)
  39. Yue, J. C. & Clayton, M. K. A similarity measure based on species proportions. Comm. Stat. Theory Methods 34, 21232131 (2005)
  40. Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 1822 (2002)

Download references

Author information

  1. These authors contributed equally to this work.

    • Heidi H. Kong &
    • Julia A. Segre

Affiliations

  1. Translational and Functional Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, Maryland 20892, USA

    • Julia Oh,
    • Allyson L. Byrd,
    • Clay Deming,
    • Sean Conlan &
    • Julia A. Segre
  2. Dermatology Branch, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland 20892, USA

    • Heidi H. Kong
  3. NIH Intramural Sequencing Center, National Human Genome Research Institute, Bethesda, Maryland 20852, USA.

    • Betty Barnabas,
    • Robert Blakesley,
    • Gerry Bouffard,
    • Shelise Brooks,
    • Holly Coleman,
    • Mila Dekhtyar,
    • Michael Gregory,
    • Xiaobin Guan,
    • Jyoti Gupta,
    • Joel Han,
    • Shi-ling Ho,
    • Richelle Legaspi,
    • Quino Maduro,
    • Cathy Masiello,
    • Baishali Maskeri,
    • Jenny McDowell,
    • Casandra Montemayor,
    • James Mullikin,
    • Morgan Park,
    • Nancy Riebow,
    • Karen Schandler,
    • Brian Schmidt,
    • Christina Sison,
    • Mal Stantripop,
    • James Thomas,
    • Pamela Thomas,
    • Meg Vemulapalli &
    • Alice Young

Consortia

  1. NISC Comparative Sequencing Program

    • Betty Barnabas,
    • Robert Blakesley,
    • Gerry Bouffard,
    • Shelise Brooks,
    • Holly Coleman,
    • Mila Dekhtyar,
    • Michael Gregory,
    • Xiaobin Guan,
    • Jyoti Gupta,
    • Joel Han,
    • Shi-ling Ho,
    • Richelle Legaspi,
    • Quino Maduro,
    • Cathy Masiello,
    • Baishali Maskeri,
    • Jenny McDowell,
    • Casandra Montemayor,
    • James Mullikin,
    • Morgan Park,
    • Nancy Riebow,
    • Karen Schandler,
    • Brian Schmidt,
    • Christina Sison,
    • Mal Stantripop,
    • James Thomas,
    • Pamela Thomas,
    • Meg Vemulapalli &
    • Alice Young

Contributions

J.O., H.H.K. and J.A.S. designed the study. H.H.K. collected patient samples. C.D. prepared the clinical samples for sequencing, which was carried out by the members of the NIH Intramural Sequencing Center Comparative Sequencing program. J.O., A.L.B. and S.C. analysed sequence data. J.O., H.H.K. and J.A.S. drafted the manuscript. All authors read and approved the final version of the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Data deposition is with the SRA and all sequences can be accessed under BioProject 46333. Human subject clinical data are deposited with dbGaP phs000266. Analysis workflow is available at https://github.com/julia0h/skinmetagenome.git.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: The 18 selected skin sites and their location on the human body. (304 KB)

    These sites represent three microenvironments: sebaceous (blue), dry (red), and moist (green). Toenail (black) is a site that does not fall under these major microenvironments and is treated separately. Pie charts represent consensus relative abundance of the kingdoms Bacteria, Eukaryota (Fungi), and virus from multi-kingdom mapping.

  2. Extended Data Figure 2: Per-sample read statistics. (538 KB)

    Additional samples (bacterial and eukaryotic mock communities) are shown. a, Boxplots (line indicates median; boxes represent first and third quartiles) show, for each site, % reads mapping to human hg19 that are discarded before analysis. Sites are coloured by site characteristic. b, Samples are ordered by label. Lines indicate the median value for that statistic; value is in parenthesis. c, Estimate of sequencing coverage. Reads seen is the number of reads in a sample sampled. Reads are then split into 20-mers, compared to a k-mer coverage table and kept only if the median k-mer coverage is below 20×. Curves are grouped by site, coloured by individual as indicated.

  3. Extended Data Figure 3: Validation of taxonomic classifications. (519 KB)

    a, Bacterial sample community diversity as a function of genome coverage for two diversity metrics, the Shannon index that measures the richness and evenness of the community (left), and number of species observed (right). Genome coverage is defined as for each genome hit, the % of genome covered by reads. Boxplots show the range of diversity values for all samples, segregated by microenvironment. Black lines indicate median; boxes represent first and third quartiles. As coverage cut-offs increase, diversity estimates drop sharply. b, Comparisons of bacterial community diversity for Metaphlan-derived classifications versus custom bacterial Pathoscope-derived classifications. Each point represents a different sample, coloured by microenvironment. With no coverage cut-offs (left), Pathoscope may overestimate diversity, which is reduced by setting a minimum 1× coverage requirement. Spearman correlation (ρ) and corresponding P values are shown. Pathoscope-derived relative abundances versus relative abundances derived from c, 16S amplicon sequencing, d, Metaphlan genus-level, e, Metaphlan-species level (ρ and P value are calculated for non-zero abundance taxa), f, Metaphlan, staphylococcal species, g, ITS1 amplicon sequencing, genus (ρ and P value are calculated for non-zero abundance taxa), and h, ITS1 amplicon sequencing, Malassezia species.

  4. Extended Data Figure 4: Full taxonomic classifications for all healthy volunteers (HV), all sites. (534 KB)

    To aid visualization of site- and individual-specific similarities, samples are grouped by site/microenvironment for each individual. Relative abundances of the most abundant skin taxa for each super-kingdom are shown. b, Taxonomic re-classification of major sites sampled by the Human Microbiome Project. Samples are from the anterior nares and retroauricular crease (skin), tongue dorsum and supragingival plaque (oral), stool, and posterior fornix (vaginal). Relative abundances of the most abundant taxa for each kingdom in the skin, for comparison, are shown.

  5. Extended Data Figure 5: Strain-level classification based on reference genomes show sub-species heterogeneity for dominant skin taxa. (387 KB)

    a, Simulations to assess sensitivity of Pathoscope-based mapping to SNPs, non-core regions, or whole genomes. Synthetic communities were created with 6, 12, or 18 genomes per community. Sizes of circles reflect the number of reads sampled from each genome, for example, 50,000, 100,000, or 500,000 reads per genome. 15 random synthetic communities for each genome group were created and mapped to SNPs, non-core regions, or the full genome set. Sensitivity is calculated from the expected versus the observed abundances. b, Full strain-level assignments for samples with relative abundances of closest related Propionibacterium acnes strains, by individual. c, Dendrograms of strain similarity. Trees were generated using core SNPs; genomes were aligned with nucmer to identify core regions, and then SNPs within these core regions were identified by calculating all pairwise differences between genomes. Bar of colours indicates delineations of subtypes where phylogenetically more similar genomes are in similar colours; for example, we defined 12 subtypes for P. acnes.

  6. Extended Data Figure 6: Strain-level classification for Staphylococcus epidermidis. (737 KB)

    a, Full strain-level assignments for samples by microenvironment. b, Description is as in Extended Data Fig. 5c. We defined 14 subtypes for S. epidermidis.

  7. Extended Data Figure 7: Full version of coreness of different module categories across skin microenvironment. (552 KB)

    A module is defined as core if occurring in >2/3 of samples for that class. Major KEGG module descriptors are shown in the different colours. Height of bars reflects the proportion of samples that a module occurs in.

  8. Extended Data Figure 8: Correlation analysis of module abundance with species abundance to infer a module’s taxonomic origin. (460 KB)

    Spearman correlation (ρ) was calculated with corresponding P value for taxa with relative abundance >0.5% and modules with greater than 0.05% relative abundance. Coryn., Corynebacterium. a, Unsupervised clustering of correlation coefficients. Species from the same genera clustering together may suggest a shared contribution of a pathway. b, Most significantly correlated taxa; colours represent broad KEGG classes. Adjusted P < 2 × 10−16.

  9. Extended Data Figure 9: Antibiotic resistance profiles in the skin. (406 KB)

    Reads were mapped to a short marker database consensus created from the ARDB database, which catalogues publicly available resistance genes. Genes are grouped into broad resistance classes; a resistance category is called present (black; absent = white) if at least one gene from its family is present.

  10. Extended Data Figure 10: Reference-free analysis of skin metagenome with adaptive iterative assembly, gene catalogue, and metagenomic clusters. (509 KB)

    a, Tracking unclassified reads. Fraction unmapped reads refers to the fraction of total reads passing quality control that do not map to the major super kingdoms Archaea, Bacteria, Eukaryota, and viruses. Samples are ordered by label and are divided by site. b, Assembly, gene-calling, and clustering workflow. c, Assembly efficacy varies significantly by k-mer depending on the site’s unique features of community complexity and sequencing depth, which is most affected by that site’s human DNA admixture. Assembly statistics are shown for samples pooled by individual, which produced higher quality assemblies than pooling by site. Because of large pool size, khmer digital normalization was used before Velvet assembly. % overall alignment rate indicates the total % of reads that map back to that sample’s assembly for each k-mer. % paired concordant indicates the fraction paired reads (of overall, not of % paired) in which both pairs of a mate map back to an assembly; discordant is where one mate of a pair does not map, or maps to a different contig. Contigs are then assessed by the maximum assembly size, the number of bases that are used in the assembly, and the number of contigs above a threshold of 300 bp. d, Effect of khmer digital normalization on individual sample assembly. Digital normalization + Velvet assembly performs similarly to Velvet assembly alone.

Supplementary information

Excel files

  1. Supplementary Data (17.9 MB)

    This file contains Supplementary Tables 1-18.

Additional data