Compilation of longitudinal microbiota data and hospitalome from hematopoietic cell transplantation patients

The impact of the gut microbiota in human health is affected by several factors including its composition, drug administrations, therapeutic interventions and underlying diseases. Unfortunately, many human microbiota datasets available publicly were collected to study the impact of single variables, and typically consist of outpatients in cross-sectional studies, have small sample numbers and/or lack metadata to account for confounders. These limitations can complicate reusing the data for questions outside their original focus. Here, we provide comprehensive longitudinal patient dataset that overcomes those limitations: a collection of fecal microbiota compositions (>10,000 microbiota samples from >1,000 patients) and a rich description of the “hospitalome” experienced by the hosts, i.e., their drug exposures and other metadata from patients with cancer, hospitalized to receive allogeneic hematopoietic cell transplantation (allo-HCT) at a large cancer center in the United States. We present five examples of how to apply these data to address clinical and scientific questions on host-associated microbial communities.


Background & Summary
The intestinal microbiota is critical to many aspects of human health 1 , yet most of our knowledge of mechanisms linking microbiota and host phenotypes comes from animal models 2 . Experiments with animals are important, but the specific mechanisms can be hard to translate into human therapies: the gut microbiota can differ between the animal and humans, and so can environmental factors such as the pathogen-free environment, diets, inbreeding, and subject geno-and pheno-types that drive a disease 2 . Instead, associations inferred directly from patients undergoing well-defined perturbations could accelerate the discovery of new mechanisms of microbiota-host interaction to benefit human health 3 . This idea fits the concept of reverse translational research 4 : Microbiota data acquired from well-monitored patients could be reused in other studies beyond the study for which it was originally collected; the longitudinal data could empower causal inference and lead to hypotheses that are more likely to produce microbiota-targeted therapies for patients.
The composition of the gut microbiota measured from feces using 16S rRNA amplicon sequencing can vary widely from person to person and can even change over time within a single person [5][6][7] . This large variability hampers the inference of microbiota-host associations. Furthermore, many microbiota studies in humans are cross-sectional, have low sample numbers or collect data from outpatients that may lack information on www.nature.com/scientificdata www.nature.com/scientificdata/ confounding factors such as the status of the immune system of the human host and drugs such as antibiotics. Limitations such as these can introduce statistical bias in microbiome research 8 . Therefore, we recently compiled an extensive hospitalome of our patients, including a vast collection of medication administrations and blood phenotype data 3 . Time-series data of hospitalized patients that link the past events to the future events in an individual person are valuable and necessary assets to reveal "mathematically causal" relationships 9 .
Here we provide a description of the gut microbiota data that we acquired from patients hospitalized to receive allogeneic hematopoietic cell transplantation (allo-HCT) at Memorial Sloan Kettering (MSK). Over the years we have been depositing the raw sequencing data at the NCBI's short read archive (SRA) in batches called BioProjects. Here we provide links to deposited data for each sample analyzed, which helps the reader obtain a paired-end fastq file for any given sample without having to navigate the multiple BioProjects that our team has submitted over the years. We also provide computer code written in Matlab (The Mathworks Inc., version 2018a), which exemplifies how to analyze and extract insights from the curated data tables that compose this rich longitudinal dataset. We have used these data before in several publications in the past 10 years 3,10-27 , in part or in whole, to design lab experiments with mice that identified mechanisms such as microbiota protection against vancomycin resistant Enterococcus 17 , the impact of dietary lactose on the expansion of Enterococcus in the gut 20 , the mechanisms of resistance to colonization by Clostridioides difficile 19 , and to find microbiome risk factors for Graft-versus-host disease (GVHD)-related mortality 20,21,28 . The data compiled here include >10,000 microbiota samples and clinical metadata from >1,000 patients. We will continue to expand these data as we continue collection, but so far, we include only patients hospitalized for allo-HCT. This choice makes the cohort uniform in crucial ways: The underlying disease is typically a hematologic malignancy such as leukemia or lymphoma, and the patients all have received chemotherapy and in some cases irradiation as conditioning regimen before infusion of hematopoietic cells from a healthy donor to reconstitute the hematopoietic system.
Patients undergoing allo-HCT are carefully monitored during their hospitalizations: HCT is considered the strongest perturbation of the immune system deliberately carried out in humans. Patients receive prophylactic antibiotics before the transplant (at MSK during certain years, a combination of a fluoroquinolone and intravenous vancomycin 29 ), and often they also receive empirical (without evidence of bloodstream infection) or therapeutic (with evidence of bloodstream infection) antibiotics in response to infection symptoms such as a neutropenic fever. Patient treatment regimens and the microbiota dynamics associated with those treatments allowed us to infer the association of individual antibiotic exposures on microbiota composition in vivo 10 , and the time-series of white blood cells collected routinely for these patients allowed us to infer the impact of the gut microbiota in the dynamics of the immune system of its human host 3 . But we believe there are many more questions to be addressed.
We compiled this vast longitudinal dataset of microbiota and associated clinical metadata from allo-HCT patients believing that it will interest investigators outside of our center. Here we organize and explain the dataset to aid those investigators in addressing new questions. Our first goal is to compile and annotate easily accessible links to public repositories where the data may be obtained. Our second goal is to guide readers through quantitative analysis of the data. The analysis examples are: displaying a patient's microbiota timeline and metadata; visualizing the entire dataset using the t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique 30 with a patient's trajectory overlaid on top; quantifying the association between exposure to antibiotics and gut microbiota composition; and performing survival analysis where changes in microbiota composition predict patient risk of bloodstream infection. This compilation of resources facilitates future data access, analysis, and interpretation. In particular, we guide readers through the sample filtering criteria specific to each example.

Methods
Human participants and clinical metadata. Sample collection from patients and analysis of the biospecimens were approved by the Memorial Sloan Kettering Cancer Center Institutional Review Board. All participants provided signed informed consent for specimen collection. Clinical metadata, including antibiotics, body temperature, timing and identity of isolates of bloodstream infections were obtained from semi-automated parsing of clinical records with extensive data curation. Absolute blood cell counts were obtained from routine clinical laboratory studies 3 . The data were stripped of all patient identifiable fields such as medical record numbers. The PatientID is a non-identifiable patient number that can be used to link clinical metadata to microbiota sample data. Note that a minority of patients acquired before we systematized our naming convention are named with "patient_witth_sample_####".
All dates of sample and clinical metadata collection were removed and set to artificial time points. To avoid publication of the calendar dates of events associated with patients, we made the event dates of any patient to be relative to a patient-specific, deidentified reference date (these deidentified dates are provided in columns "Timepoint", "TimepointOfTransplant", "StartTimepoint" and "StopTimepoint"; see below). The secret reference dates will not be disclosed. We also provided columns that conveniently represent the event dates relative to the date of nearest HCT of any patient (see below for columns "DayRelativeToNearestHCT", "StartDayRelativeToNearestHCT", "StopDayRelativeToNearestHCT").
For each sample, duplicate 50-μl PCRs were performed, each containing 50 ng of purified DNA, 0.2 mM deoxynucleotide triphosphates, 1.5 mM MgCl2, 2.5 U Platinum Taq DNA polymerase, 2.5 μl of 10 × PCR buffer, and 0.5 μM of each primer designed to amplify the V4-V5: 563 F (5′-nnnnnnnn-NNNNNNNNNNNN-AYTGGGYDTAAAGNG-3′) and 926 R (5′-nnnnnnnn-NNNNNNNNNNNN-CCGTCAATTYHTTTRAGT-3′). A unique 12-base Golay barcode (Ns) precedes the primers for sample identification 31 , and one to eight additional nucleotides were placed in front of the barcode to offset the sequencing of the primers. Cycling conditions were 94 °C for 3 min, followed by 27 cycles of 94 °C for 50 s, 51 °C for 30 s, and 72 °C for 1 min. For the final elongation step, 72 °C for 5 min was used. Replicate PCRs were pooled, and amplicons were purified using the QIAquick PCR Purification Kit (Qiagen). PCR products were quantified and pooled at equimolar amounts before Illumina barcodes and adaptors were ligated, using the Illumina TruSeq Sample Preparation protocol. The completed library was sequenced on an Illumina MiSeq platform following the Illumina recommended procedures with a paired-end 250 × 250 bp kit.
Amplicon sequence variants (ASVs) were identified from 16S paired-end sequencing using the Divisive Amplicon Denoising Algorithm (DADA2) pipeline including filtering and trimming of the reads 32 . Reads were trimmed to remove the first 180 bp or the first point with a quality score Q < 2, and reads containing ambiguous nucleotides (N) or if two or more errors were expected based on the quality of the trimmed read were removed. Taxonomy was assigned to ASVs using a 8-mer based classifier trained by IDTaxa 33 using the SILVA database 34 .
Quantification of microbiota density and detection of the vanA gene. qPCR was performed on DNA extracted from the samples using DyNAmo SYBR Green qPCR kit (Finnzymes) and 0.2 μM of the universal bacterial primer 8 F (5′-AGAGTTTGATCCTGGCTCAG) and the broad-range bacterial primer 338 R (5′-TGCTGCCTCCCGTAGGAGT-3′). Standard curves were prepared by serial dilution of the PCR blunt vector (Invitrogen) containing 1 copy of the 16 s rRNA gene. Cycling conditions were 95 °C for 10 minutes followed by 40 cycles of 95 °C for 30 seconds, 52 °C for 30 seconds, and 72 °C for 1 minute. Detection of the vanA gene was conducted via PCR performed using specific primers for the vanA gene: forward, 5′-AATCGGCAAGACAATATGAC; reverse, 5′-ACCTCGCCAACAACTAACGC.

Data Records
The following data have been compiled as comma-separated value (csv) files in Figshare 35 . Folders "samples" and "taxonomy" contain a single file each at the moment. We expect to expand the contents of these folders as we compile more data (e.g., taxonomy of fungal ITS sequences, which is a recent interest of our team 36 . 1) and the t-SNE plot (Fig. 2). • ColorOrder: auxiliary integers used to group and sort the orders of ASVs in stacked barplot (Fig. 1) technical Validation Displaying a patient timeline. We will display an individual patient's microbiota timeline with associated clinical metadata as our first example analysis. We chose one patient (PatientID = 1511) and plotted the metadata by displaying the dates of sample acquisition relative to the HCT date (day −5 to day +21, Fig. 1). Therefore, the data filtering step here was straightforward: we excluded any sample that did not meet the criteria of PatientID = 1511 and −5≤DayRelativeToNearestHCT≤ + 21.
The top panel shows the maximum body temperature recorded for each day as a black dot, with red dots representing fevers (Temperature >37 °C or 100.4 °F). The second panel shows the anti-infectives given to the patient, with horizontal bars indicating daily administration of antibiotics, antivirals or antifungals. The third panel shows the count of neutrophils for the patient, with red dots representing neutropenia (<500 cell/μL). Finally, the bottom plot shows the gut microbiota composition analyzed by 16S rRNA amplicon sequencing in the 13 microbiota samples that fit the filtering criteria. Note that the values plotted are relative abundances. Relative abundances are www.nature.com/scientificdata www.nature.com/scientificdata/ obtained by dividing the number of read counts for that ASV by the number of total read counts in that sample (that total value may also be called the depth of that sample). Note that in the inference example below (Fig. 4) we will compute absolute abundances by multiplying the relative abundances by the qPCR16S column in tblqpcr.csv for that sample (the number of 16S copies per gram of stool sample measured by qPCR).
We can see how patient 1511 had a significant change in their microbiota composition after intravenous administration of piperacillin/tazobactam at day + 3 (indicated by a red dashed line). This combination of a beta-lactam (piperacillin) and a beta-lactamase inhibitor (tazobactam) is typically empirically administered at MSK to eradicate bacterial infections and, as such, was administered here in response to the fever occurring on that same day. We observe the expected drastic changes in the microbiota composition immediately following piperacillin/tazobactam administration 10 , with some bacteria dropping in their relative abundance (e.g., Erysipelatoclostridium) and other bacteria increasing (e.g., Lactobacillus and Streptococcus). The neutrophil counts show prolonged neutropenia (red), and engraftment on day 19 post-HCT.
Visualizing the entire dataset of microbiota compositions. The large number of samples in our dataset provides a comprehensive overview of the microbiota compositional states experienced by the patients during their hospitalization. Here we used t-SNE 30 to obtain a two-dimensional representation of all microbiota compositions which amount to 12,546 data points (Fig. 2). The t-SNE plot collapses high-dimensional microbiotas of HCT-receiving patients into several distinct clusters including high-diversity samples (located closer to the center of the plot) and lower diversity samples dominated by different bacteria. Samples dominated by Enterococcus (represented in green) are the most common type of low-diversity compositional state observed in the patients, and Enterococcus domination has been previously associated with higher risk of bloodstream infections by vancomycin-resistant Enterococcus 13 , with higher risk of graft-vs-host disease related death and a higher risk of overall mortality 20 .
We can overlay the trajectory of an individual patient's microbiota composition on the t-SNE plot. This type of trajectories were used before to trace the success or failure of fecal microbiota transplants 18   www.nature.com/scientificdata www.nature.com/scientificdata/ we overlaid the trajectory of PatientID = 1511 between day −5 and day + 21 (white arrows in Fig. 2 connect the consecutive microbiota samples shown in Fig. 1).
Inferring the impact of antibiotics on the microbiota using regularized regression. All patients in our dataset received antibiotics, but they received different antibiotics at different times and in different combinations. In previous studies we have used this variability to infer the impact of antibiotics on the different bacteria present in the gut microbiota by correlating microbiota changes with antibiotic exposures 3,10 .
In this example we use a similar approach to infer the impact of antibiotics on the 20 most abundant microbes (ASVs) in the dataset. We selected for analysis only medications that belong to antibacterial classes, and we excluded atovaquone that is used for prophylaxis of pneumocystis and toxoplasma. As in a previous study from our team 10 , here we made a distinction between the administration route (e.g. oral or intravenous). Figure 3 shows the fraction of all patients who have ever received a specific antibiotic orally or intravenously (the two antibiotic administration routes accounts for 97% of all cases). The most frequently used antibiotics were sulfamethoxazole trimethoprim and vancomycin for oral and intravenous administrations respectively. Intravenous vancomycin was mainly used prophylactically for almost all patients; however, it was also empirically administered orally to kill infectious bacteria (e.g., Clostridioides difficile) in the gut because intravenous vancomycin may not reach adequate concentrations in the gastrointestinal tract. Quinolones (ciprofloxacin and levofloxacin), metronidazole, and azithromycin were given both orally and intravenously at similar frequencies among patients. Piperacillin/ tazobactam was the second most commonly administered antibiotic by intravenous infusion.
For simplicity, we assume antibiotics independently affect the exponential growth rate of microbes and the magnitude of their effects can be inferred by the following linear regression: Most abundant taxon in a sample www.nature.com/scientificdata www.nature.com/scientificdata/ where N k is the absolute abundance of a microbial taxon k obtained by multiplying its relative abundance by the qPCR value, g k is the intercept representing the maximum exponential growth rate in the absence of antibiotics, ε k,j is the coefficient corresponding to antibiotic j representing the susceptibility of microbe k, μ j (t) is a binary variable that represents the presence/absence of the antibiotic j at time t, and M a is the total number of antibiotics considered in the analysis. For any consecutive microbiota sample pair p observed between t p,i (initial sample) and t p,f (final sample), Eq. (1) can be transformed into an integral form , represents the time interval between the two samples and represents the total exposure time to antibiotic j within the period. We discarded samples pairs when either or both of N k (t p,i ) and N k (t p,f ) is 0.
By taking all pairs of consecutive samples into account (p = 1, ···, M p ), we formulated an independent linear regression problem for each microbe i  www.nature.com/scientificdata www.nature.com/scientificdata/ where ⋅ F 2 is the Frobenius norm. We chose λ to minimize the sum of squares error on unseen (test) data using 3-fold cross-validation (with the Matlab option of 10 Monte-Carlo repetitions) and the optimal λ was then applied to the entire dataset for estimating the values of growth and susceptibility parameters.
We limited ourselves to sample pairs that are at most 3 days apart (the minimum interval that includes >50% data) and both have absolute abundances determined by qPCR data. We also focused on the 20 most abundant ASVs that have non-zero abundances among samples-from a total of 289 patients-fulfilling the minima criteria for inclusion in the regression. We included antibiotics administrated orally and intravenously but separated our analysis for the two administration routes. We also included only anti-bacterial drugs and grouped these drugs based on their drug category (Fig. 3). We separated metronidazole and aztreonam from the miscellaneous class as independent groups and removed other drugs in that class. For most anti-bacterial antibiotic classes, it is typical that one or two antibiotics were administered much more often than the rest drugs in the same class (Fig. 3).
The effect sizes obtained using this model show that the route of administration influences the effects of antibiotics on microbiota compositions (Fig. 4): intravenous drugs tend to be more effective to reduce microbial abundances in the gut relative to oral drugs. Piperacillin/tazobactam (penicillins) and vancomycin (glycopeptide antibiotics)-the most commonly administered antibiotics by intravenous infusion-have the strongest inhibitory effects on many commensal bacteria such as Blautia, Ruminococcus, Erysipelatoclostridium. The timeline of microbial composition shift in Fig. 1 illustrated the inferred negative effect of piperacillin/tazobactam on ASV 6 (Erysipelatoclostridium) at day 3. However, the expansion of Lactobacillus upon administration of piperacillin/ tazobactam-suggesting a positive antibiotic effect-was incorrectly inferred. This potential caveat due to the simplicity of our model is discussed below with details.

intestinal domination increases a patient's risk of bloodstream infection. Patients undergo-
ing allo-HCT are at high risk to bloodstream infections, especially during neutropenia (Fig. 5). We previously reported patient risk to bloodstream infection by Enterococcus after the gut microbiota became dominated by Enterococcus, where domination was defined as a relative abundance >30% 13 . A similar analysis was later conducted to determine the risk that a patient would develop a bloodstream infection by various gram-negative bacteria after the gut microbiota became dominated by a bacteria of the same genus, with domination again defined as >30% 23 .
Here we run that same type of survival analysis: We restricted our analysis to the period from day −15 to day +35 and we determined the risk of a bloodstream infection with Enterococcus (Fig. 6a) and Escherichia (Fig. 6b). We used the Cox proportional hazards model with domination as a time-dependent covariate, which starts at value 0 and changes to 1 once the relative abundance of the genus in the stool increases above a given threshold. www.nature.com/scientificdata www.nature.com/scientificdata/ We first conducted analysis assuming a 30% domination value. Recapitulating our previous results, intestinal domination by Enterococcus had a hazard ratio of 3.8 for blood-stream infection by Enterococcus (P < 0.05, with a [2.6-5.5] 95% confidence interval) and intestinal domination by Escherichia had a hazard ratio of 8.1 for bloodstream infection by Escherichia (P < 0.05 with a [4.4-15.0] 95% confidence interval).
We then repeated the same analysis for dominations defined at 10%, 1% and 0.1% (Fig. 6c,d). The analyses reveal that Enterococcus abundances as small as 1% increase the risk of blood-stream infection by Enterococcus  www.nature.com/scientificdata www.nature.com/scientificdata/ significantly (P < 0.05). The case for Escherichia was more sensitive: Escherichia abundances as small as 0.1% increased the risk of bloodstream infection by Escherichia significantly (P < 0.05). Such analyses, all conducted using the Cox proportional hazards model, can be extended to find other factors that increase the risk of bloodstream infections and other clinical complications of allo-HCT.
Discussions and potential caveats. We present repositories for a longitudinal microbiota dataset obtained from efforts at MSK from the past decade of monitoring cancer patients hospitalized to receive allo-HCT. These microbiota data, combined with the curated clinical metadata presented here, can serve as powerful hypothesis generators for microbiome studies. For example, we illustrate the use of these data with five analyses and we provide links to code repositories where these examples can be followed by the interested user. Despite the comprehensiveness of our data collection, not all factors (e.g., diet) that influence gut microbiota composition have been included and it is not yet available to analyze their effects using our datasets. We expect to release more data types in the future.
We showed an example of inference where we quantified the impact of antibiotics on the 20 most abundant microbes (ASVs). That example (Fig. 4) illustrates how to pose a supervised question to address an important microbiota problem: how different antibiotics might have an impact on the different bacteria that makeup the microbiota. We had addressed a related question before using straightforward statistical tests for a dataset of 94 patients, where we determined whether an antibiotic could lead to intestinal domination 13 . We have also used more sophisticated Bayesian methods to infer the impacts of antibiotics on the microbiota using a dataset with 18 of these patients 10 . In the present manuscript, we presented a similar approach and used a simple model of exponential bacterial growth combined with penalized least squares regression to solve the same problem for the most abundant taxa, but now for a dataset with a much larger number of patients. Despite the simpler linear model, the results captured the negative impact of piperacillin/tazobactam on many commensal bacteria, confirming our previous studies 3, 10 . We here also visualized Lactobacillus expansion and intestinal domination, illustrated in Fig. 1, but this was not captured by our model. The most likely reason that our simple model of exponential growth could not capture this is that after an expansion of Lactobacillus following the administration of piperacillin/tazobactam, its abundance remains relatively unchanged over multiple consecutive days. Since the antibiotics continued to be administered during the time, the inference will combine all the data and conclude that-on average-there is no measurable impact of piperacillin/tazobactam on Lactobacillus. Therefore, the model as developed here is only sensitive to transient expansions of bacteria but unable to capture bacterial expansions followed by sustained dominations. One possible way to resolve this issue in future iterations of the analysis is to include more realistic details in a more complicated model, such as an ecological carrying capacity and microbial interactions, as done before 3,10,37-39 .
Finally, some notes of caution on generalizing these findings: The patient cohort presented here consists entirely of patients undergoing allo-HCT at MSK, which represent states of the human immune system and microbiota compositions far from that of healthy people. The microbiota changes observed during a patient's treatment occur in parallel with a high burden of antibiotic exposure, dietary perturbation, and gastrointestinal inflammation induced by the conditioning regimen. The combination of these profound microbiota perturbations rarely occurs for healthy individuals. The clinical complications associated with the microbiota changes observed in these patients may also be specific. For example, we found that Enterococcus or Escherichia expansions increase the risk of bloodstream infections by those bacteria when detected in the microbiota at abundances as low as 1% (Fig. 6). This risk may be specific to transplant recipients damaged by conditioning regimen with injury to the intestinal barrier that could facilitate the translocation of pathogens from the gut into the blood 40 . Still, allo-HCT patients provide a unique opportunity to study microbiota dynamics in such extreme situations. There are many other opportunities for analysis that we did not discuss here such as the impact of the microbiota on the counts of immune cells in circulation 3 . The findings made from these data may be applicable to other HCT patients and perhaps other patients undergoing similar perturbations.

additional information
Correspondence and requests for materials should be addressed to J.S. or J.B.X.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.