Targeted metabolomics and medication classification data from participants in the ADNI1 cohort

Alzheimer’s disease (AD) is the most common neurodegenerative disease presenting major health and economic challenges that continue to grow. Mechanisms of disease are poorly understood but significant data point to metabolic defects that might contribute to disease pathogenesis. The Alzheimer Disease Metabolomics Consortium (ADMC) in partnership with Alzheimer Disease Neuroimaging Initiative (ADNI) is creating a comprehensive biochemical database for AD. Using targeted and non- targeted metabolomics and lipidomics platforms we are mapping metabolic pathway and network failures across the trajectory of disease. In this report we present quantitative metabolomics data generated on serum from 199 control, 356 mild cognitive impairment and 175 AD subjects enrolled in ADNI1 using AbsoluteIDQ-p180 platform, along with the pipeline for data preprocessing and medication classification for confound correction. The dataset presented here is the first of eight metabolomics datasets being generated for broad biochemical investigation of the AD metabolome. We expect that these collective metabolomics datasets will provide valuable resources for researchers to identify novel molecular mechanisms contributing to AD pathogenesis and disease phenotypes.


Background & Summary
Alzheimer's disease is a degenerative brain disorder and the most common cause of dementia, presenting as the most common neurodegenerative disease in the United States 1,2 . It is characterized by a decline in memory, language, problem-solving and other cognitive skills that affects a person's ability to perform everyday activities 3 . Data suggests that pathophysiological changes associated with AD begin decades before the emergence of clinical symptoms 4,5 . AD is becoming an increasing health burden in the United States and globally due to population aging 6 . The disease is defined by the presence of tau neurofibrillary tangles and Aβ plaques, but coincident pathologies like Lewy body disease, vascular pathology and TDP-43 (transactive response DNA-binding protein 43) deposits are commonly found in AD patients. Current symptomatic therapeutic treatments have modest effects and do not modify the disease course. Researchers hope to develop therapies targeting specific genetic, molecular, and cellular mechanisms so that the actual underlying cause of the disease can be slowed or prevented but currently our understanding of disease mechanisms remains limited. While the majority of AD clinical trials to date have focused on Aβ treatments, other therapeutic approaches are necessary. Understanding biochemical trajectory of disease and metabolic changes related to Aβ and Tau pathology and cognitive decline is essential to advance our understanding of AD etiology as well as for developing novel approaches for drug development.
Recent advances in analytical chemistry led to the emergence of a new field called metabolomics. Metabolomics allows simultaneous measurement of 100's to 1,000's of metabolites for mapping perturbations in interconnected pathways and in metabolic networks enabling a systems approach to the study of AD 7,8 . An emerging body of evidence data supports the potential of metabolomics to provide added information for the prediction of AD and has identified a number of potentially important biochemical pathways in AD [9][10][11][12][13][14][15] . Though promising, in many cases to this point cohorts are quite small with little replication, and therefore do not allow the effects of confounds such as medications to be addressed. Larger and more comprehensive cohorts are needed to provide the statistical power to detect and develop robust predictive metabolomics models, while also accounting for confounding effects related to medication intake, gender and aging. The Alzheimer's Disease Neuroimaging Initiative (ADNI) unites researchers around the globe to define the progression of Alzheimer's disease. ADNI researchers collect, validate and utilize data such as MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers from thousands of subjects as predictors for the disease 16,17 . The Alzheimer's Disease Metabolomics Consortium (ADMC) has the goal of building a comprehensive metabolomics database for AD in partnership with ADNI. This will be a national resource for the Alzheimer's community that enables interrogation of global metabolic changes within a pathway and network context and where metabolomics data can be used to compliment and inform genomics and imaging data. Eight targeted and non-targeted metabolomics platforms that have their own strengths and limitations are being used to profile thousands of ADNI subjects.
For this dataset, we used a targeted, widely utilized and cross-validated metabolomics platform AbsoluteIDQ®-p180 (Biocrates AG) to profile baseline serum samples from ADNI1 cohort where vast data exist on each patient including cognitive decline and imaging changes over many years, information on CSF markers, genetics and other omics data. The goal of this data generation is to aid in the discovery of metabolic failures correlated with disease and progression and biomarkers for a range of important physiological processes in AD. We describe a foundation for automated curation of metabolomics data, including R scripts for removing analytes with poor precision or with significant missing values, and samples that have missing clinical data or are metabolic outliers. We address approaches within this cohort for dealing with confounds that impact metabolomics data and findings including medications, broadly applicable to pharmacometabolomic investigations [18][19][20] .

Methods
Alzheimer's disease neuroimaging initiative (ADNI) cohort Clinical and demographic data used for this study were obtained through the ADNI data repository (http://adni.loni.ucla.edu/). Written informed consent was obtained for all participants and prior Institutional Review Board approval was obtained at each participating institution. All demographic www.nature.com/sdata/ SCIENTIFIC DATA | 4:170140 | DOI: 10.1038/sdata.2017.140 information, neuropsychological and clinical assessment data, and diagnostic information used in this study are available from the ADNI clinical data repository (http://adni.loni.ucla.edu/). Information about ADNI can be found in Petersen et al. 16 and at http://www.adni-info.org/ 16 . Key clinical and demographic variables for the ADNI1 participants as of May 2016 are summarized in Table 1 and available through Synapse (see Table 2). Note that ADNI data collection is ongoing, so variables on LONI may have been updated since that subset was downloaded. We have included this snapshot in order to enable analytic reproducibility despite a dynamic source of truth.

Serum collection and sample management
Morning fasting blood samples from the baseline visit were included in the study (all but 69 were fasting). Samples were collected in two bar-coded 10 ml red-top plastic Vacutainer blood tubes, blood was allowed to clot for 30 min followed by a 15 min centrifugation at 3,000 rpm (1,500 rcf) as described in the ADNI standard operating procedures (www.adni-info.org). Then the serum was transferred into a bar-coded 13 ml polypropylene transfer tube and capped and allowed to freeze in dry ice. Samples were shipped overnight to the ADNI biomarker core laboratory at the University of Pennsylvania Medical Center. Samples were thawed once in the core facility and aliquoted to 0.5 ml samples, then subsequently aliquoted once more for individual laboratory analyses. A 20 μl sample aliquot was delivered to the Duke Proteomics and Metabolomics Shared Resource for analysis with the p180 platform.
Metabolomics analysis and QC using the AbsoluteIDQ p180 kit Sample preparation. Samples were prepared and analyzed in the Duke Proteomics and Metabolomics Shared Resource using the AbsoluteIDQ p180 kit (Biocrates Life Sciences AG, Innsbruck, Austria) in accordance with the user manual. In brief, after the addition of 10 μl of the supplied internal standard solution to each well on a filterspot of the 96-well extraction plate, 10 μl of each serum sample, quality control (QC) samples, blank, zero sample, or calibration standard were added to the appropriate wells ( Fig. 1). The plate was then dried under a gentle stream of nitrogen. The samples were derivatized with phenyl isothiocyanate (PITC) for the amino acids and biogenic amines, and dried again. Sample extract elution was performed with 5 mM ammonium acetate in methanol. Sample extracts were diluted with either 40% methanol in water for the UPLC-MS/MS analysis (15:1) or kit running solvent (Biocrates Life Sciences AG) for flow injection analysis (FIA)-MS/MS (20:1).
Quality control samples. The analysis of the samples using the AbsoluteIDQ p180 kit was performed using four specific sets of quality controls. First, low/mid/high level QC samples provided by Biocrates Life Sciences AG were prepared and analyzed on each plate as recommended by the manufacturer. These QC samples were used for a technical validation of each kit plate. Second, to allow appropriate inter-plate abundance scaling based specifically on this cohort of samples, we generated a Study Pool QC (SPQC) by combining approximately 10 μl from the first 76 samples for analysis. This sample was frozen in aliquots of 25 ul then prepared and analyzed twice on each plate. Third, there were 20 blinded analytical duplicates obtained from the same serum draw scattered throughout the study in a manner blinded to the investigators until data was sent to the ADNI informatics core for unblinding. The commonly used reference materials NIST SRM-1950 plasma (n = 3 per plate) and GoldenWest serum pool (n = 1 per plate) were also analyzed on each plate to allow cross-comparison against other sample cohorts in the future. Figure 1a shows the preparation layout for the 96-well plates as utilized in this study. In total, eleven plates were prepared in order to analyze 831 serum samples. The blank, zero sample, calibration standards, and Low/Mid/High QC samples provided with the kit were arranged as recommended by Biocrates. In order to improve the ability to compare results with other metabolomics studies and reduce plate-to-plate batch effects, six additional wells were used for the additional QC samples as described above: two wells for the study pool QC (SPQC), one well for the GoldenWest Pooled Serum Standard 21    for cohort samples. The analysis order of each plate is summarized in Fig. 1b, and was arranged to maximize quantitative accuracy and precision within a plate and limit the potential for batch effects. The analysis order included running the standard curve twice, once at the beginning and end of the samples (LC-MS/MS only). For both LC-MS/MS and FIA-MS/MS analysis, the Biocrates QC's and Goldenwest Serum QC were prepared once but injected in technical triplicate, once before, in the middle (after 38 samples) and at the end of the sample set. The SPQC samples (n = 2) were each analyzed once, with one analysis before and one after all samples on the plate. The NIST SRM-1950 plasma (n = 3) were also analyzed once each at the beginning, middle, and end of the cohort samples. Bracketing the standard curves and nesting the analytical samples between the QCs offers the best chance of observing any system drift and assuring optimal instrument performance across the sample set.
Quantitative UPLC-MS/MS and FIA-MS/MS analysis. Mass spectrometry analysis was performed based on Standard Operating Procedures (SOP #8114) provided by Biocrates for the AbsoluteIDQ p180 kit. Chromatographic separation of amino acids and biogenic amines was performed using an ACQUITY UPLC System (Waters Corporation) using an ACQUITY 2.1 mm × 50 mm 1.7 μm BEH C18 column fitted with an ACQUITY BEH C18 1.7 μm VanGuard guard column, and quantified by calibration curve plotting ratio of analyte to internal standard versus standard concentration, fitted using a linear regression with 1/x weighting. All amino acids and biogenic amines utilize either deuterated or 13 C stableisotope labeled internal standard of the exact analyte or closely-eluting compound of similar class. Acylcarnitines, sphingolipids, and glycerophospholipids were analyzed by flow injection analysis tandem mass spectrometry (FIA-MS/MS) and quantified by internal standard calibration; eight separate internal standards are used to quantify the various acylcarnitines, while a single internal standard is used for each of the other lipid classes. Thus, FIA-MS/MS analytes are reported as semi-quantitative values except where a stable-isotope labeled internal standard of that exact analyte was used. Samples for both UPLC and FIA analyzed using a Xevo TQ-S mass spectrometer (Waters Corporation) using positive electrospray ionization operating in the Multiple Reaction Monitoring (MRM) mode. MRM transitions (compound-specific precursor to product ion transitions) for each analyte and internal standard were collected over a scheduled retention time window using tune files and acquisition methods provided in the AbsoluteIDQ p180 kit. The UPLC data were imported into TargetLynx (Waters Corporation) for peak integration, calibration and concentration calculations. The UPLC data from TargetLynx and FIA data were analyzed using Biocrates' MetIDQ v5.4.8 software. The kit data are reported in detail in the Supplementary Information on LONI, along with a color-coded key denoting samples that were below the limit of detection ( oLOD), below the lowest calibration standard ( oLLOQ), or quantified based on a ratio to a class-based internal standard (semi-quantitative). The data generated for the study samples,

Data processing
Statistical preprocessing was performed using the open-source, statistical software, R v3.2.4 (www.rproject.org), with scripts available for download at http://dx.doi.org/10.7303/syn7354353. The processing included the steps briefly described herein, and graphically depicted as a flowchart in Fig. 2.
In the first step we excluded four samples due to erroneous inclusion in the cohort (thawed during shipment), and the values of each analyte were scaled across the different plates using the Study Pool QC (SPQC) duplicates analyzed twice on each plate. Given SPQC duplicates, the correction factor for each analyte in a specific plate was obtained by dividing its global average by its average within the plate to adjust for the batch effects, yielding the Intermediate Data Level 2.
The second set of steps include filtering of analytes based on quality metrics. We applied two criteria to filter individual metabolites as part of our data quality control (QC) evaluation process based on the 20 blinded ADNI duplicates: 1) a coefficient of variation (CV) o20% across plates, and 2) an intraclass correlation coefficient (ICC) >0.65. ICC compared the two measurements for each of the blinded duplicates. Additionally, analytes with >40% of measurements below the lower limit of detection ( oLOD) were excluded from the analysis. Combined, these three steps allow only the most robust analytes from the panel through to the Level 3 data matrix, and reduced the total number of analytes The steps between Level 3 and Level 4 in the pipeline (Fig. 2) perform missing value replacement and allow for exclusion of samples due to other missing data that may vary from study to study. Remaining samples with values reported as 'oLOD' were imputed using LOD/2 value for each specific analyte. Also, there were 73 samples determined to be pre-analytical outliers for one or more of the following reasons which were flagged and removed from the dataset. These included a total of 69 samples identified as nonfasting, 2 samples lacking corresponding body mass index (BMI) values, and 1 for which no baseline medication record was reported. After these steps, the Intermediate Data-Level 4 contained n = 754 samples (734 subjects) and n = 138 analytes. The last steps in the statistical pretreatment pipeline serve to combine replicate measurements to give one value per biological sample, filter out any statistical outlier subjects, and perform log-transformation if necessary. To obtain a single value for the 20 subjects with blinded duplicates we calculated the average of the duplicates for each subject and reported this single value. We checked for the presence of outlier subjects by performing principal components analysis, and evaluating the subject distance from the centroid in the K-dimensional space based on principal components that explained >90% cumulative variance. Subjects located more than 7 s.d. from the mean were flagged as outliers. This procedure identified two additional samples that were excluded from the final data matrix. Finally, log2 transformation was performed for those analytes which show P-value for D'Agostino o0.05 and Skewness test >2. The final preprocessed data matrix (Final Data Matrix- Fig. 2, Level 5) contained data for 732 subjects and 138 analytes.
It should be noted that the statistical curation process described above is very stringent, leaving behind only the most robust analytes but in the process potentially excluding some good measurements. One weakness in our pipeline is that by filtering on intraclass correlation coefficient (ICC) it is possible that we filtered out some robust measurements which simply had very narrow biological measurement range over the blinded samples, precluding the observation of a correlational trend. Examples of this potentially include histamine (17.2% missing values, 5.4% CV, 0.09 ICC) and methionine (0% missing values, 8.1% CV, 0.65 ICC), which almost certainly would have been left in the dataset using many commonly-used and less stringent filtering criteria. To address this shortcoming, future studies are being designed with at least three measurements of each blinded replicate sample and triplicate preps of the SPQC on each plate instead of the NIST SRM-1950, to allow more robust filtering based on imprecision across a wider dynamic range. Metabolites excluded from the analysis suggested here may also be recovered when reanalyzing data in the context of comparative data from other cohorts.

Collection and curation of medication data
Many classes of medications have been shown to affect metabolism and change levels of certain metabolites [18][19][20]24 . It is thus necessary to take drug information into account as a potential confounder for metabolomics analysis. In order to convert free text medication information into computable drug data, we applied a pipeline described previously 25 . In brief, as shown in Fig. 3, we employed the National Library of Medicine's (NLM) RxNorm API (application programming interface) to match drug names extracted from patient medication information containing lexical variations and misspellings to standardized drug concept identifiers. Corresponding concept identifiers are returned along with confidence scores. Low scoring terms were reviewed manually and adjusted as appropriate.
We mapped all versions of a drug, whether brand name or generic, to its respective ingredients, then mapped those ingredients to corresponding drug classes. A subset of drug categories were selected from 3 standardized drug classification systems (NDF-RT, ATC, and MeSH) based on input from experts in Alzheimer's disease and metabolomics (see Table 3 (available online only)). Criteria for selection of classes to be included in analysis were classes of drugs known to impact metabolomics pathways and/or those likely to be taken by a cognitively impaired population. In this way, each patient was assigned a Boolean flag for whether or not he or she was taking any drug in each respective class. Binary variables can then be used to address potential confounding in subsequent association analyses. The code for this pipeline, including R scripts and API configuration files, is available in Synapse: http://dx.doi.org/ 10.7303/syn7477310. The final table showing which ADNI1 participants were taking which classes of drugs at their first visit is available at http://dx.doi.org/10.7303/syn7440367.1. Important note. The method described here is the first in a series of iterative approaches to tackle the complex challenge of medications as confounding variables in metabolomic profiling. Medication terminology and software tools continue to evolve. We recommend that those performing future analyses related to medication effects revisit this site and the ADNI data repository for updated curation of medication data.

Data Records
The primary access site for this dataset is through Sage Bionetworks' Synapse platform (Data Citation 1). ADNI's data use agreement prohibits redistribution of ADNI data outside of LONI, so actual data files are hosted by the University of Southern California's Laboratory of Neuroimaging (LONI). The scripts used for data processing and medication mapping, however, reside in the Synapse platform. Core data files along with associated metadata files, scripts, and Supplementary Files are listed in Table 2. Note that ADNI requires registration to access the data. Researchers may apply for data access at https://ida.loni. usc.edu/collaboration/access/appLicense.jsp.
R Scripts for data processing can be found at http://dx.doi.org/10.7303/syn7354353. Input files (Fig. 2, Level 0) are found in (Data Citation 2) and (Data Citation 3) for FIA and UPLC respectively. Processed Also included in Supplementary Information on LONI (http://adni.loni.usc.edu/data-samples/accessdata/) are the original Excel format exports from the MetIDQ software (Biocrates, Inc). These files include information on calibration ranges, limits of quantification and detection for the assays, and QC sample measurements. They use the original blinded identifiers.
Similarly, the code for the medication mapping pipeline and a link to the current medications file on LONI is available through (Data Citation 7). The medications file must be downloaded and placed in the same directory as the medication mapping R scripts in order to reproduce the medication mapping workflow. The output table showing which patients were taking which classes of drugs is available at (Data Citation 8). Note that, as described in the readme file that accompanies these scripts, manual intervention was used at three points in the pipeline: running RxMix using two different configuration files, and expert review of medication name mapping to accept, reject, or correct results for low-scoring matches.

Technical Validation
AbsoluteIDQ® p180 kit has been fully validated according to European Medicine Agency Guideline on bioanalytical method validation, and this kit has been utilized in over 200 peer-reviewed publications including a number in dementia and AD 9,13,26 . A recent ring trial showed that inter-laboratory precision was o20% for 82% of the analytes measured with the kit, and 83% of the analytes were accurate within o20% 27 . Additionally, each analyzed kit plate includes an automated technical validation to approve the validity of the run and to provide verification of the actual performance of the applied quantitative procedure including instrumental analysis. Interplate technical validation of each analyzed kit plate was performed using MetIDQ software based on results obtained and defined acceptance criteria for blank, zero samples, calibration standards and curves, low/medium/high level QC samples and measured signal intensity of internal standards over the plate. Technical validation for the Xevo TQS was performed according to the following criteria. For the Blank samples the signal intensity for all metabolites and internal standards had to be smaller than a defined minimum value. Signal intensities obtained for zero sample were used for the calculation of plate specific limit of detection (LOD) for FIA-MS/MS analysis. LOD value was defined as concentration that corresponded to three times level of the blank sample. A specific standard measurement (calibrators) was considered valid when calculated concentration was within +/ − 30% range of the target concentration. For a specific analyte a minimum of 75% of all calibration standards had to be valid. Biocrates-provided QC samples were human plasma pool spiked with analytes at known concentrations at three different levels. The valid measured concentration range was set for each analyte separately and was within +/ − 45% of a target concentration. For a specific analyte a minimum of 67% of all QCs had to be valid as well as a minimum of 50% of all QCs of a certain level had to be valid. Additionally signal intensity for internal standards had to be within valid minimum and maximum intensity value defined by kit manufacturer. All measured plates fulfilled the above described criteria hence confirming the quality and accuracy of the obtained quantitative metabolomics data according to manufacturer recommendations.
To ensure the quality and reproducibility of the quality control and analysis performed prior to data release, independent analysts completed the computational workflow, and achieved reproducibility out to three significant digits across all calculations.

Usage Notes
Details on how to apply for data access and usage rules can be found at the ADNI website: http://adni. loni.usc.edu/data-samples/access-data/. In brief, users agree to keep the data secure and not to attempt to re-identify research participants. Users also agree to acknowledge ADNI and the ADMC in any derivative publications as follows: 1. On the by-line of the manuscript, after the named authors, include the phrase 'for the Alzheimer's Disease Neuroimaging Initiative*' with the asterisk referring to the following statement and list of names: *Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf 2. On the by-line of the manuscript, after the named authors, include the phrase 'for the Alzheimer's Disease Metabolomics Consortium**' with the double asterisk referring to the following statement and list of names: **Data used in preparation of this article were generated by the Alzheimer's Disease Metabolomics Consortium (ADMC). As such, the investigators within the ADMC provided data but did not participate in analysis or writing of this report. A complete listing of ADMC investigators can be found at: https://sites.duke.edu/adnimetab/admc-team-directory/ 3. The results published here are in whole or in part based on data obtained from the AMPAD Knowledge POrtal accessed at doi: 10 Access to scripts and other files described herein that are available through the https://www.synapse. org/#!Synapse:syn2580853/wiki/409840 AMPAD Knowledge Portal hosted on the Sage Bionetworks Synapse informatics data sharing platform, requires adherence to the terms of use described at http:// docs.synapse.org/articles/governance.html. Users are required to sign an oath (http://docs.synapse.org/ assets/other/oath.html) stating they will not re-identify participants, redistribute the data, or use for advertising and that they will keep data secure, protect privacy, support open access, report any breaches, credit participants, and follow privacy laws. Because they are managed by different entities, users must register for separate user accounts for LONI (where data are stored) and https://www.synapse.org/#!Synapse:syn2580853/wiki/409840 AMPAD Knowledge Portal on Synapse (repository for scripts and additional information) respectively.