Taxonomic signatures of cause-specific mortality risk in human gut microbiome

The collection of fecal material and developments in sequencing technologies have enabled standardised and non-invasive gut microbiome profiling. Microbiome composition from several large cohorts have been cross-sectionally linked to various lifestyle factors and diseases. In spite of these advances, prospective associations between microbiome composition and health have remained uncharacterised due to the lack of sufficiently large and representative population cohorts with comprehensive follow-up data. Here, we analyse the long-term association between gut microbiome variation and mortality in a well-phenotyped and representative population cohort from Finland (n = 7211). We report robust taxonomic and functional microbiome signatures related to the Enterobacteriaceae family that are associated with mortality risk during a 15-year follow-up. Our results extend previous cross-sectional studies, and help to establish the basis for examining long-term associations between human gut microbiome composition, incident outcomes, and general health status.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Teemu Niiranen (teemu.niiranen@utu.fi) Leo Lahti (leo.lahti@utu.fi) Apr 29, 2020 Provide a description of all commercial, open source and custom code used to collect the data in this study, specifying the version used OR state that no software was used.
The data used in this study are available from the THL Biobank upon submission of a research plan and signing a data transfer agreement (https://thl.fi/en/web/thl-biobank/for-researchers/application-process). The data are not openly available as they contain sensitive information from healthcare registers.

nature research | reporting summary
October 2018 Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

MRI-based neuroimaging
No sample size calculations were performed, as the primary objective of the study was assessment of cardiovascular health in the Finnish population. The sample size was mainly determined by the availability of resources.
Samples from 20 subjects were excluded because of total sample readcount was low (< 50 000), yielding n=7211 available for unsupervised analysis. For survival analysis of mortality, further 156 were excluded, as only 7055 participants had the full covariate information available (BMI, systolic blood pressure, smoking, antihypertensive medication use, diabetes status, use of antineoplastic or immunomodulating agents).
After contacting several groups with large-scale microbiome data, we were unable to find cohorts with similar prospective outcome data as in our study. For validation, we split the cohort into two subsamples according to geographic regions (Eastern Finns, n=4979 vs Western Finns, n=2184) with differing genetic backgrounds, lifestyles, and mortality rates. The then assessed the association between our main exposure (PC3) and mortality in both subsamples. As an additional analysis, we identified mortality-associated microbiome features in the Eastern population based on the Random Survival Forest model, and then tested their performance in the Western population.
Participants were not randomly allocated to experimental groups. The study was an observational study based on a stratified random sample of the Finnish population aged 25-74 years from a several geographical areas. The dates and causes of deaths (the outcome under study, n=729) were obtained from the National Causes-of-Death register. The relevant covariates were adjusted for in our statistical analyses using data collected by a nurse during the baseline examination (BMI, systolic blood pressure), participants' self-reported answers to questionnaire (smoking, antihypertensive medication use) and national registers when such information was available (diabates from the Hospital Discharge Register and the Drug Reimbursement Register; use of antineoplastic or immunomodulating agents from the Drug Purchase Register).
Issue of blinding was not relevant, as study had no expirimental group allocation.