A resource of lipidomics and metabolomics data from individuals with undiagnosed diseases

Every year individuals experience symptoms that remain undiagnosed by healthcare providers. In the United States, these rare diseases are defined as a condition that affects fewer than 200,000 individuals. However, there are an estimated 7000 rare diseases, and there are an estimated 25–30 million Americans in total (7.6–9.2% of the population as of 2018) affected by such disorders. The NIH Common Fund Undiagnosed Diseases Network (UDN) seeks to provide diagnoses for individuals with undiagnosed disease. Mass spectrometry-based metabolomics and lipidomics analyses could advance the collective understanding of individual symptoms and advance diagnoses for individuals with heretofore undiagnosed disease. Here, we report the mass spectrometry-based metabolomics and lipidomics analyses of blood plasma, urine, and cerebrospinal fluid from 148 patients within the UDN and their families, as well as from a reference population of over 100 individuals with no known metabolic diseases. The raw and processed data are available to the research community so that they might be useful in the diagnoses of current or future patients suffering from undiagnosed disorders. Measurement(s) Metabolomics • Lipidomics Technology Type(s) gas chromatography-mass spectrometry • Ultra High-performance Liquid Chromatography/Tandem Mass Spectrometry Factor Type(s) age group • sex Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment blood plasma material • urine material • cerebrospinal fluid material Sample Characteristic - Location United States of America Measurement(s) Metabolomics • Lipidomics Technology Type(s) gas chromatography-mass spectrometry • Ultra High-performance Liquid Chromatography/Tandem Mass Spectrometry Factor Type(s) age group • sex Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment blood plasma material • urine material • cerebrospinal fluid material Sample Characteristic - Location United States of America Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13656581


Background & Summary
Metabolites and lipids can be responsive to both genetic and environmental influences. Variations may occur due to host genes, disease states, lifestyle, diet, medications and the interaction with the gut microbiome 1 . Many rare diseases have genetic origins, but their symptoms can also be impacted by non-inherited causes such as infections, cancers, and other acquired conditions. Metabolomics and lipidomics analyses have been helpful in identifying inborn errors of metabolism, and in characterizing acquired metabolic conditions such as diabetes and metabolic syndrome 2,3 . These conditions are typically associated with a small number of metabolites and/or lipids that are significant outliers, and easily identified as abnormal.
In contrast, the metabolic changes in rare and undiagnosed diseases may be more subtle, consisting of complex patterns of minor changes of a large number of analytes rather than a few significant outliers. Due to the rare nature of these disorders, the number of individuals with a given phenotype is usually limited to one or just a few, precluding the use of the balanced study designs typically used in metabolomics. For these reasons the use of metabolomics and lipidomics analyses in the evaluation of rare and undiagnosed diseases presents many unique challenges.
The NIH Common Fund's Undiagnosed Diseases Network (UDN) was established to accelerate the diagnosis and clinical management of rare or previously unrecognized diseases, and to advance research in disease mechanisms 4 . The UDN is composed of multiple clinical sites around the United States, and multiple research cores including DNA sequencing (whole exome and whole genome), model organisms (e.g., drosophila and zebrafish) and metabolomics 4,5 . As the Metabolomics Core for Phase I of the UDN, our role was to provide comprehensive untargeted measurements to identify qualitative and quantitative changes of metabolites (metabolomics) and lipids (lipidomics) in biofluids from probands (i.e. individuals with an undiagnosed disease accepted into the UDN) to assist in the evaluation and/or identification of the causes of rare and undiagnosed diseases. Here, we describe in detail the raw and processed metabolomics and lipidomics data from analyses of UDN patient samples and make the data available to the research community so that it might be useful in the diagnoses of current or future patients suffering from undiagnosed disorders. Our previous publication (Webb-Robertson et al. 6 ) described the detailed statistical approach used for processing this same underlying data set, and so we refer readers to that work for more details on the statistical analyses employed.

Methods
Study design. The identification of metabolite and lipid outliers via metabolomics evaluation of individual probands by untargeted metabolomics required a normal or reference population for comparison. A reference dataset against which metabolomics data from UDN probands and their relatives could be compared was generated by metabolomics analysis of plasma, urine, and CSF from individuals with no known metabolic disease ( Fig. 1). Approval for the study of the individuals in the UDN was provided by the National Institutes of Health under protocol number 15-HG-0130. The UDN is registered at ClinicalTrials.gov under identifier NCT02450851.
UDN probands suffer from undiagnosed diseases and thus are typically represented as a sample size of one; therefore, understanding normal variation within a proband's condition is not possible. To address this issue, we performed power analyses of historical plasma and urine data from the Pacific Northwest National Laboratory (PNNL), assuming an uneven study design (e.g. n = 1 for probands and n = ≥(10-150) for healthy controls) 6 . This analysis determined that data from 80-120 healthy individuals would be required to perform a well-powered statistical analysis of the data from a UDN proband. This reference dataset is used to understand normal metabolome variation in a population of similar demographics to the UDN population, which is essential for evaluating the metabolome and lipidome data from individual UDN probands and characterizing the pathophysiology and etiology of their undiagnosed disease.
Reference population. The composition of the reference population (approximately 50% children (<18 years of age) and approximately 50% female) was selected to represent the demographics of the participants enrolled in the Undiagnosed Diseases Program, an NIH intramural program upon which the UDN is based 7 . Biofluids for the reference population included samples collected from the Oregon Clinical & Translational Research Institute Biolibrary (adult plasma and urine), the Oregon Health & Science University Layton Aging and Alzheimer's Research Center (adult CSF), the Vanderbilt University Metabolic Screening Laboratory (paediatric plasma), the Mayo Clinic Biochemical Genetics Laboratory (paediatric and young adult urine and CSF), and BioIVT (adult CSF) 6 (Table 1; figshare 8 ('Demographic information for reference population')). The individuals composing the reference dataset also consented to sample collection under Institutional Review Boards (IRB) at the respective institutions.

Fig. 1
Overview of the study design. Biofluid samples were collected from probands at the UDN clinical sites and then extracted for metabolomics (urine, plasma, CSF) and lipidomics (plasma and CSF) analyses using chromatography coupled to mass spectrometry (GC-MS for metabolomics and LC-MS/MS for lipidomics). Data were pre-processed, including data quality checks, normalized, and compared against data from the reference population of healthy individuals. Metabolomics and lipidomics results in the form of Z-score, log2 fold change and p-value per metabolite and lipid of the proband (and associated family members, if applicable) were reported back to the respective UDN Clinical Site for diagnostic assistance.
For the paediatric and young adult CSF, due to the limited volumes available (100 µl), samples were pooled to reach the required volume of 200 µl for metabolomics analysis. Each CSF paediatric and adolescent reference sample is thus composed of two individuals of the same sex and similar age (e.g., 2 years old combined with 3 months old, and 14 years old combined with 16 years old).
Biofluid collection for UDN participants and sample management. Biofluids were collected from UDN probands by the UDN clinical site at which the individual was evaluated. Written consent from all UDN participants and/or legal guardians was provided prior to sample collection and approved IRB. For each sample, the collection time, fasting state and duration, symptoms, diet supplements, and medications were documented. (figshare 8 ('Listing of metabolomics and lipidomics raw data files'). To assist in determination of potential genetic and environmental influences on the metabolomics findings, when it was possible samples were also collected from unaffected family members. Metabolomics and lipidomics analyses were conducted on 281 UDN participants, including 148 probands and 133 family members (101 unaffected, 25 affected, 7 unknown) (figshare 8 ('Listing of metabolomics and lipidomics raw data files'). This comprised 540 biofluid samples for analysis (295 plasma, 239 urine, and 6 CSF) ( Table 2). Combining the reference population and the UDN participants, mass spectrometry analyses were conducted on 2781 biofluid samples. UDN probands with diagnoses are available (figshare 8 ('UDN probands with available diagnoses')).
Blood samples for plasma were collected in purple top EDTA Vacutainer ® tubes. The blood was centrifuged at 10 000 × g for 10 minutes at 4 °C. Three 50 µl aliquots of plasma were transferred into 0.5 mL Sarstedt Biosphere ® SC Micro Tubes. Samples were flash frozen in liquid nitrogen or quick frozen in dry ice/ethanol prior to storage in either a −80 °C or liquid nitrogen freezer with appropriate labels (ID, sample type, and collection date).
Urine samples were requested to be the first morning void and were collected in a polypropylene container. The urine was centrifuged at 1000 × g for 5 minutes at 4 °C to remove any cells and particulates. Three 100 µl aliquots were transferred into 0.5 mL Sarstedt Biosphere ® SC Micro Tubes and flash frozen in liquid nitrogen or quick frozen in dry ice/ethanol prior to storage in either a −80 °C or liquid nitrogen freezer with appropriate labels (ID, sample type, and collection date).
CSF was collected by lumbar puncture in the L3/L4 or the L4/L5 inter-space. If the samples were not blood contaminated, the sample tubes were placed on ice (or dry ice if available), and then transferred to a −80 °C freezer. If the samples were blood contaminated, the samples were centrifuged immediately (prior to freezing) and the clear CSF transferred to new tubes. Three 200 µl aliquots were transferred into 0.5 mL Sarstedt Biosphere ® SC Micro Tubes and flash frozen in liquid nitrogen or quick frozen in dry ice or ethanol prior to storage in a −80 °C freezer with appropriate labels (ID, sample type, and collection date) All biofluid samples were shipped to the Pacific Northwest National Laboratory on dry ice and stored in −70 °C freezers until sample processing for mass spectrometry (MS) analysis.  www.nature.com/scientificdata www.nature.com/scientificdata/ Quality control samples. The NIST SRM 1950 was used as a plasma QC 9,10 . The NIST QC is composed of 100 healthy individuals between 40-50 years old, an equal number of men and women, and a race distribution representative of the US population. The NIST QC is a commercially available reference material (certified until year 2023) and was chosen due to the multi-year nature of this study. For urine and CSF, as no commercially available reference materials were identified, pools were generated from the reference population for each respective biofluid and used as QCs.
Sample batches. Sample batches were formed based on the number of analyses that could be performed in approximately one day (~33 analyses). Randomized run orders were generated based on sex, age, ethnicity (if provided), family association, and clinical site (if samples from more than 1 clinical site were available at the time of batching) (see Technical Validation section) prior to extraction, sample preparation, chemical derivatization of metabolites, and instrument analysis runs.
The instrument run order included a batch structure that enabled data normalization via Quality Control (QC)-based Robust Locally Estimated Scatterplot Smoothing (LOESS) Signal Correction (QC-RLSC) 11 to specifically account for batches of samples that were not analysed back to back but dispersed over a longer timeframe 6 ( Table 3). The Pilot batches were the initial batches to be analysed and were used to confirm the normalization approach. Both Pilot and Project batches were processed using the same methodologies. This normalization method required a batch structure with specific placement of QC samples. For GC-MS analyses, the batch began with 2 blanks, 1 fatty acid methyl ester (FAME), 1 blank, 3 QCs, samples with evenly dispersed single QCs, and ending with 2 QCs. LC-MS/MS batches were similar except there was no FAME and a blank was run after the first 3 QCs, the middle QC, and at the very end to assess carryover 6 .
Extraction of metabolites from urine. For urine, 100 μL was used for metabolite extraction, as previously described 14 . Samples were transferred to MμlTI SafeSeal Sorenson microcentrifuge tubes to which 50 μL of GC-MS internal standards (malonic acid-d4, fructose 13C6, L-tryptophan-d5, lysine-d4, alanine-d7, stearic  www.nature.com/scientificdata www.nature.com/scientificdata/ acid-d35, benzoic acid-d5, octanoic acid-d15 at a final concentration of 1 μg/μL each) and 100 μL of a 1 mg/mL solution of urease prepared in water were added. The samples were incubated for 30 minutes at 37 °C with mild shaking to deplete urea. Metabolites were then extracted with concomitant protein precipitation by addition of 1 mL of cold (−20 °C) methanol. Samples were vortexed for 30 seconds and precipitated proteins were isolated by centrifugation. The supernatants were transferred to glass autosampler vials and then dried in vacuo. Metabolite extracts were stored dry at −20 °C until chemical derivatization (see below). chemical derivatization of metabolites. Polar metabolites were chemically derivatized prior to metabolomics analysis. Two post-extraction standards (pentadecanoic acid-d3 and 3-hydroxymyristic acid-d5 at 1 μg/μL final concentration) were added to monitor instrument performance. Chemical derivatization of metabolites was previously detailed 14 . To protect carbonyl groups and reduce the number of tautomeric isomers, 20 μL of methoxyamine in pyridine (30 mg/mL) was added to each sample, followed by vortexing for 30 seconds and incubation at 37 °C with generous shaking for 90 minutes. To derivatize hydroxyl and amine groups to trimethylsilylated (TMS) forms, 80 μL of N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) with 1% trimethylchlorosilane (TMCS) was added to each vial, followed by vortexing for 10 seconds and incubation at 37 °C with shaking for 30 minutes. The samples were allowed to cool to room temperature and were analysed the same day.
Gc-MS analysis. An Agilent GC 7890 A coupled with a single quadrupole MSD 5975 C was used to analyze chemically derivatized metabolites. GC-MS analysis was previously detailed 14 . Briefly, 1 μL of each sample was injected onto a HP-5MS column (30 m × 0.25 mm × 0.25 μm; Agilent Technologies, Inc). The injection port temperature was held at 250 °C throughout the analysis. The GC oven was held at 60 °C for 1 minute after injection then increased to 325 °C by 10 °C/min, followed by a 5-minute hold at 325 °C. Metabolite identification and data processing. Metabolite identifications and data processing were conducted as previously detailed 14 . GC-MS raw data files were processed using the Metabolite Detector software, version 2.0.6 beta 16 . Retention indices (RI) of detected metabolites were calculated based on the analysis of the FAMEs mixture, followed by their chromatographic alignment across all analyses after deconvolution. Metabolites were identified by matching experimental spectra to an augmented version of the Agilent Fiehn Metabolomics Retention Time Locked (RTL) Library 17 , containing spectra and validated retention indices. All metabolite identifications were manually validated. The NIST 08 GC-MS library was also used to cross validate the spectral matching scores obtained using the Agilent library and to provide identifications for metabolites that were initially unidentified. The three most abundant fragment ions in the spectra of each identified metabolite were automatically determined by Metabolite Detector, and their summed abundances were integrated across the GC elution profile. A matrix of identified metabolites, unidentified metabolite features, and their corresponding abundances for each sample in the batch were exported for statistics.
Processing the data from the analyses of the reference population resulted in the identification of 81 plasma polar metabolites (across 16 super classes and 27 classes as categorized in the Human Metabolome Database 18,19 , 116 urine metabolites (across 17 super classes and 28 classes), and 82 CSF metabolites (across 14 super classes and 26 classes) ( Table 4) Lipid identification and data processing. LC-MS/MS lipidomics data were analyzed using LIQUID (Lipid Informed Quantitation and Identification) 15 . Analysis parameters included an initial precursor mass error tolerance of 20 ppm (i.e. ±10 ppm), and fragment mass error tolerances of 20 ppm (±10 ppm) and 500 ppm (±250 ppm) for HCD and CID MS/MS events, respectively. Confident identifications were selected by manually evaluating the MS/MS spectra for diagnostic and corresponding acyl chain fragments of the identified lipid. In addition, the precursor isotopic profile, extracted ion chromatogram, and mass measurement error along with the elution time were evaluated. For certain lipids, multiple LC peaks having nearly identical MS/MS spectra were observed, suggesting the presence of lipid stereoisomers. In these cases, the stereochemistry of the lipid isomers could not be completely determined based on the LC-MS/MS data alone, and so these isomers are annotated with "_A", "_B" or "_C" at the end of the lipid name. Typically, the mass measurement error of confidently identified lipids was within ± 2.5 ppm. Given the time-consuming nature of manual validation of each identified lipid, a library of confident lipid identifications was generated from the reference dataset and select UDN participants (3 NIST QCs, 6 pooled plasma of reference population, 2 reference individuals, and 3 UDN participant). All LC-MS/ MS data were aligned and gap-filled to this target database for feature identification using the identified lipid www.nature.com/scientificdata www.nature.com/scientificdata/ name, observed m/z, and the retention time using MZmine 2 20 (see figshare 8 ('Parameters used for MZmine2 processing of lipidomics data')). Data from each ionization type were aligned and gap-filled separately. Aligned features were manually verified and peak apex intensity values were exported for statistical analysis. All subsequent batches were aligned to this library of confident lipid identifications.
To correct for batch retention time (RT) shifts for alignment to the reference library, an in-house tool to correct for linear RT shifts was used. The instrument files were converted into.mzXML files using MSConvert 21 . Each file was associated with a target list containing the name, RT, and m/z of the internal standards within a batch and was imported into MZmine. As the internal standards alone did not elute across the entire gradient, two lipids that were present in all samples in positive mode (carnitine(10:1) and CE(18:1)) and one in negative mode (HexCer(d18:1/24:0) were included in the target list as they eluted near the start and end of the gradient. The peak alignment of each target was manually validated, and corrected if needed, and the RT of each target lipid was exported. These targets acted as anchor points for the RT correction. Using the RT anchors for each target, all instrument files within a batch were shifted and aligned to the reference and new.mzXML were generated for  www.nature.com/scientificdata www.nature.com/scientificdata/ subsequent alignment in MZmine. For all batches aligned to the reference list, lipid identifications were randomly selected (approximately 30 lipids) and verified using LIQUID to ensure that identification in the reference and sample batches matched.
Processing the data from the analyses of the reference population resulted in the identification of 462 plasma lipids across 6 lipid categories and 23 lipid subclasses (as categorized by LipidMaps) [22][23][24] , and 208 CSF lipids across 6 lipid categories and 17 lipid subclasses (Table 5).
Statistical analysis. We have previously described in detail the statistical approach used for processing the data 6 , and briefly summarize this below. To facilitate the identification of potentially disease-associated analyte profiles of UDN participants, a reference population of individuals with no known metabolic diseases was established as described above. Batches of samples from UDN participants were analysed and compared to this reference population as outlined in Webb-Robertson et al. 6 . Briefly, quality control (QC) processing of the reference dataset includes log2 transformation and the removal of any identified or unidentified features not present in at least 10% of the samples. Samples with missing or low abundance values and an uncorrelated pattern of expression by Pearson correlation and rMd-PAV 25 were assessed to determine whether the seemingly poor behaviour was most likely due to biological or to technical/sample preparation issues. If biological issues appeared to be the cause, the sample was retained in the current batch for further analysis; if technical issues appeared to be the cause, the sample was omitted from further analyses. QC processing for the participant samples included the same steps as for the reference samples; however, participant samples required stronger evidence before removal than reference samples.
Normalization of the reference data and the participant data was performed in two steps 6 . First, QC-RLSC accounted for batch effects, and was performed on a per-batch basis 11 . This required identical QC samples to be run in every batch of samples (for reference samples and UDN samples alike), as described above. Quality control-based robust LOESS signal correction (QC-RLSC) was implemented using the parameter values described previously 11 . Namely, a missingness threshold requiring the observation of a molecule in at least half of the QC samples, filtering of molecules with RSD above 30 percent, and possible polynomial degrees of first and second order. To account for differences in the amount of sample material analysed by GC-MS or LC-MS, QC-RLSC was followed by global median centering of each sample, where each log2 biomolecule value within a sample was normalized via subtraction of the corresponding sample median (also on the log2 scale)..

LipidMaps Lipid Subclass
Plasma CSF   www.nature.com/scientificdata www.nature.com/scientificdata/ To identify unique features in the analyte profiles of participants, results were compared to those from the reference dataset 6 . A univariate approach was applied that compared the feature values of the participants to the mean and standard deviation of the feature values in the reference dataset using z-scores 26 . An absolute value z-score threshold was used to obtain a list of metabolites and/or lipids with outlying z-scores that may have potential diagnostic significance. Additionally, for a given participant and biomolecule, log2 fold changes relative to the reference data were computed as the difference between the participant's log2 value and the median log2 value of the reference population.

Data Records
The raw LC-MS and GC-MS data files in .raw and.D format, and converted files in .mzML format, and.CDF format, respectively were deposited and are publicly available at the MassIVE repository (MSV000084717 27 , MSV000085506 28 , MSV000085508 29 ). The normalized values for all identified lipids and metabolites for the UDN individuals and reference population are also available in MassIVE. The evidence supporting the molecular identifications (e.g. fragment ion m/z, retention times) are provided (figshare 8 ('The evidence supporting the molecular identifications (e.g. fragment ion m/z, retention times)')). The deposited data also contains the post-processed data including the log2 fold change, Z-score, and p-value for each lipid and metabolite per UDN individual in .csv format. For the lipid results, the identifications made in positive and negative ionization mode were consolidated into a single file. Family member data is included in the associated proband files. In addition, for the UDN probands that have been diagnosed, the diagnosis name and relevant gene information are provided (figshare 8 ('UDN probands with available diagnoses')).
The data deposited to MassIVE contains up to three directories: peak, quant, and raw. Each biofluid data repository also contains automatically generated subdirectories prefixed with "ccms". Users of the data should obtain data from the peak, quant, and raw directories listed above and detailed below:

technical Validation
To ensure unbiased data production, randomization orders were created and followed for sample extraction, GC-MS derivatization, and MS run orders. Family units, meaning probands and their relatives, were analysed within the same batch. Batch sizes were limited to the number of samples that could be analysed by GC-MS in approximately one day due to the stability of the chemically derivatized metabolites. Approximately 33 samples composed a batch. Samples were randomized based on sex, age, ethnicity (if provided), family association, and clinical site (if samples from more than 1 clinical site were available at the time of batching). Randomization orders were created as sufficient samples accumulated to make up a new batch over the 2.5 years of the study. To monitor data quality, the QC samples (across all molecules) were evaluated with prior batches collected to verify removal of batch effects via normalization. In addition, on a batch-by-batch basis, data quality was monitored by visual inspection of the log2 internal standard values across all samples within a batch.
To evaluate the consistency of the data collection process the coefficient of variation (CV) was utilized. The CV is defined as the standard deviation divided by the mean and lower values signify lower variability. Using data from the reference population, QC samples per platform and biofluid, the median CV along with the first and www.nature.com/scientificdata www.nature.com/scientificdata/ third quartile are shown in Table 6. The median CV of the lipid negative mode CSF is the greatest, possibility due to the lower number of samples available and lipids identified.
To evaluate the reproducibility of the results, we assessed the lipidomics results from one UDN proband from whom we had 9 samples collected over a period of 9 months. The samples were analysed in 3 different batches, separated by up to 1 year (Fig. 2). The proband's mother had 3 samples analysed in two batches at time intervals coinciding with the proband's samples. As shown in Fig. 2, the Z-score pattern of both the proband's and the mother's samples remain consistent between batches across the one-year timespan between the collection and analysis of the first and last set of samples (October 2016 to October 2017).

code availability
Statistical processing and analyses were performed in R version 3.4.0. Quality control and median normalization were performed using the R package pmartR version 0.9.0, freely available on GitHub (https://github.com/ pmartR/pmartR) 30 . Default parameter values for pmartR function calls were used. QC-RLSC and the calculation of log2 fold changes and z-scores were carried out using in-house R functions and are available on Github (https:// github.com/pmartR/qcrlsc).  Table 6. The coefficient of variation (CV) per platform and biofluid for the reference population QC samples calculated from raw values. Min = minimum CV, 1st Qu. = first quartile, 3rd Qu. = third quartile, Max. = maximum CV, POS = positive mode ionization, NEG = negative mode ionization.