Circulating amino acids and amino acid-related metabolites and risk of breast cancer among predominantly premenopausal women

Known modifiable risk factors account for a small fraction of premenopausal breast cancers. We investigated associations between pre-diagnostic circulating amino acid and amino acid-related metabolites (N = 207) and risk of breast cancer among predominantly premenopausal women of the Nurses’ Health Study II using conditional logistic regression (1057 cases, 1057 controls) and multivariable analyses evaluating all metabolites jointly. Eleven metabolites were associated with breast cancer risk (q-value < 0.2). Seven metabolites remained associated after adjustment for established risk factors (p-value < 0.05) and were selected by at least one multivariable modeling approach: higher levels of 2-aminohippuric acid, kynurenic acid, piperine (all three with q-value < 0.2), DMGV and phenylacetylglutamine were associated with lower breast cancer risk (e.g., piperine: ORadjusted (95%CI) = 0.84 (0.77–0.92)) while higher levels of creatine and C40:7 phosphatidylethanolamine (PE) plasmalogen were associated with increased breast cancer risk (e.g., C40:7 PE plasmalogen: ORadjusted (95%CI) = 1.11 (1.01–1.22)). Five amino acids and amino acid-related metabolites (2-aminohippuric acid, DMGV, kynurenic acid, phenylacetylglutamine, and piperine) were inversely associated, while one amino acid and a phospholipid (creatine and C40:7 PE plasmalogen) were positively associated with breast cancer risk among predominately premenopausal women, independent of established breast cancer risk factors.


INTRODUCTION
Breast cancer is the most common malignancy among women in the United States, with more than 250,000 cases diagnosed each year 1 . Known modifiable risk factors are estimated to account for only around one-third of postmenopausal breast cancers [2][3][4] , and an even smaller fraction of premenopausal cancers 2,5 . Thus, new strategies are needed for the identification of modifiable risk factors, especially for premenopausal breast cancers.
Metabolites are small molecules that are produced and consumed by cellular metabolism. The study of the complete collection of metabolites, called metabolomics, provides a direct signature of cellular activity in the body and has emerged as a powerful tool for the diagnosis, characterization, and prediction of disease. Metabolomic methods have uncovered biomarkers for a wide variety of cancers including colorectal, gastric, pancreatic, liver, ovarian, breast, urinary, esophageal, and lung 6 . In breast cancer, metabolomics has proven useful for tumor biology characterization, predicting treatment response, anticipating recurrence, and estimating prognosis 7 .
In this study, we assessed the association of over 200 prospectively measured circulating amino acid and amino acidrelated metabolites with risk of breast cancer among the predominantly premenopausal women (1057 cases and 1057 matched controls) of the Nurses' Health Study II (NHSII).

RESULTS
Study population 1057 cases and 1057 matched controls were included in this study (Table 1). Women were an average 53 years old and predominantly premenopausal (80%) at the time of blood collection. At diagnosis, 42% of the women were premenopausal and 46% were postmenopausal. The mean time between blood collection and diagnosis was 8 years (SD = 4.4), ranging from 10 months to 17.4 years. (1st quartile: 4.25 years; 3rd quartile: 11.6 years).

Multivariable models of the joint association of all metabolites
The inverse association of piperine with risk of breast cancer met the threshold for statistical significance (p-value < 0.05 and q-value < 0.20) in all three simple (without adjustment for risk factors) multivariable models, Lasso, Elastic Net, and Random Forests (Tables 2 and 3). In addition, DMGV and N2,N2-dimethylguanosine were detected in the Random Forests model with minimal adjustment, satisfying a q-value threshold of 0.05. Higher levels of C40:7 PE plasmalogen and creatine were associated with increased breast cancer risk in CLR Lasso models (q-value < 0.20). These associations remained significant after further adjustment for risk factors (nominal p < 0.05) with the exception of N2,N2dimethylguanosine in Random Forests (p = 0.05) and piperine in Elastic Net (nominal p-value < 0.05, q-value > 0.2).
When all eleven identified metabolites were assessed together in an adjusted lasso CLR model, ten metabolites remained independently associated with risk of breast cancer. The direction of association in the lasso CLR model was consistent with the previous CLR and multivariable models except for N2,N2dimethylguanosine who's coefficient was estimated to be equal to zero, reflecting its high correlation with kynurenic acid and 2aminohippuric acid (Fig. 2).

DISCUSSION
We conducted a large-scale study of 207 circulating amino acid and amino acid-related metabolites, and risk of breast cancer in a nested case-control study (1057 cases and 1057 matched controls) within NHSII, a cohort of predominantly premenopausal women. Higher levels of six metabolites, 2-aminohippuric acid, DMGV, kynurenic acid, N2, N2-dimethylguanosine, phenylacetyl glutamine, and piperine, were associated with lower breast cancer risk while higher levels of asparagine, creatine, and three lipids, C20:1 LPC, C34:3 PC plasmalogen, C40:7 PE plasmalogen, were associated with increased breast cancer risk. Inverse associations between 2-aminohippuric acid, DMGV, kynurenic acid, phenylacetyl glutamine and piperine, and the positive associations with creatine and C40:7 PE plasmalogen remained statistically significant after adjusting for established risk factors and were selected by at least one multivariable modeling approach (Lasso, Elastic Net, Random Forests). Notably, associations between 2aminohippuric acid, piperine, and kynurenic acid remained significant even after multiple testing correction. None of the metabolites showed effect modification by BMI, except DMGV. None of the metabolites showed heterogeneity by ER status, except asparagine. Piperine is a polyphenol responsible for the pungency of black and long pepper and exhibits a wide range of properties: antidiabetic, anti-inflammatory, immunomodulatory, reduction of insulin resistance, and enhanced drug bioavailability [24][25][26] . Piperine also inhibits tumorigenesis, tumor angiogenesis, cancer cell proliferation, cancer cell migration and invasion, and enhances apoptosis and autophagy 27 . Experimental and cell line studies identified anti-breast cancer specific mechanisms of action, including decreased matrix metalloproteinase 9 (MMP-9) and MMP-13 expression, induced apoptosis through activation of caspase-3 and inhibition of human epidermal growth factor receptor 2 (HER2) gene expression 28 . Synergetic effects of piperine and chemotherapy drugs (paclitaxel, doxorubicin), hormone therapy drugs (tamoxifen), radiotherapy, TRAIL-based and nanodelivery-based therapy drugs (paclitaxel, rapamycine) were observed 28 . Notably, piperine inhibited growth and motility 29 , and enhanced efficacy of TRAIL-based therapy 30 in triple-negative breast cancer cells, the most aggressive breast cancer subtype. In a previous study of the associations of diet-related metabolites and breast cancer risk within the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer screening trial (n = 1242, 621 cases), piperine was found to be modestly correlated with liquor consumption (correlation = 0.16) and similar to our study, inversely associated with breast cancer risk (OR comparing 90th versus 10th percentile = 0.74 (0.56-0.99, p = 0.045)), after adjusting for BMI and other potential confounders 23 . Similarly, piperine was associated with lower risk of breast cancer (OR per 1 SD increase = 0.94 (0.89-0.99), p = 0.01) in the SU.VI.MAX cohort (n = 400, 200 cases) 20 . Notably, PLCO women were postmenopausal while the SU.VI.MAX cohort included premenopausal and postmenopausal women at the time of blood collection suggesting that the association between piperine and breast cancer may be independent of menopausal status. Odds ratios and 95% confidence intervals (CI) per 1 SD increase for metabolites significantly associated with risk of overall breast cancer (p-value < 0.05 and q-value < 0.2) in Nurses' Health Study II, among premenopausal women only, by BMI category (<25, ≥25)***, and by ER status (ER positive, ER negative)****. Creatine, although not selected by the conditional logistic regression (p-value < 0.05, q-value > 0.2), is shown here for completeness, as is was selected by the multivariable models. *Simple model: adjusts for matching factors including menopause status at blood draw, time of blood draw, date/season of blood draw, luteal day at blood draw, fasting status at blood draw, menopausal status at diagnosis and race. **Adjusted model: in addition to matching factors, this model adjusts for BMI at age 18, weight change between age 18 and time of blood draw, age at menarche, parity and age at first birth, family history of breast cancer, personal history of benign breast disease, physical activity, alcohol consumption, exogenous hormone use, and breast feeding history. ***All p-interaction with BMI category were >0.07 except for DMGV p-interaction = 0.04 (simple and adjusted model). ****All p-heterogeneity by ER status were >0.13 except for asparagine p-heterogeneity = 0.02/0.03 (simple/adjusted model). Table 2. All metabolites that met q-value < 0.2 (marked as √ in the table) in at least one primary model analysis that adjusts for matching factors (*marks p-value < 0.05 a ).

Metabolites
Logistic  Dimethylguanadino valeric acid (DMGV), an organic keto acid, is the product of transamination of asymmetric dimethylarginine (ADMA), which inhibits nitric oxide signaling that is crucial to endothelial function-excess ADMA is associated with increased risk of cardiovascular disease 31,32 . Plasma DMGV is positively associated with incident coronary artery disease, cardiovascular mortality, nonalcoholic fatty acid liver disease, and type II diabetes 33,34 . Circulating DMGV is directly correlated with resistance to the metabolic benefits of exercise 35 . Consumption of vegetables and red wine are associated with lower circulating DMGV, while sugar-sweetened beverage consumption is associated with higher circulating DMGV 34,36 . DMGV was associated with lower breast cancer risk in our study. The association between DMGV and breast cancer risk has not been previously assessed in prospective cohort studies. However, this metabolite was correlated with liver fat in the offspring cohort of the Framingham Heart Study (β = 0.02, 95% CI: 0.018-0.022, p < 10 −23 ) 33 . In addition, in the same study, baseline DMGV levels were associated with higher risk of type 2 diabetes, with replication of this association in the Malmo Diet and Cancer study and the Jackson Heart Study 33 . Large prospective studies are required to validate the association between DMGV and breast cancer risk. If replicated, experimental studies will be needed to understand the complex relationship between DMGV, diet, physical activity, CVD, type II diabetes, and breast cancer. Of note is that this analysis was in predominantly premenopausal women, among whom adiposity is also inversely associated with risk of breast cancer for reasons that are still not fully understood 37 .
Although the measurement platform used here was optimized to measure amino acids and related metabolites, not lipids, our analysis included a small number of lipids and identified a few significant associations. Plasmalogens are a subclass of phospholipids (components of the cell membrane and involved in cell signaling and cell cycle regulation 38 ) that constitute 15-20% of all phospholipids in cell membranes 39 . Plasmalogens have head groups that are usually either phosphatidylcholine (PC plasmalogens) or ethanolamine (PE plasmalogens), and are characterized by an ether bond to an alkenyl group in the sn-1 position while the sn-2 position is usually occupied by polyunsaturated fatty acids 39 . Certain cancers exhibit altered plasmalogen levels: circulating plasmalogens are depressed in pancreatic cancer patients 40 and increased in gastric carcinoma patients 41 compared to healthy controls. Our study identified two plasmalogens, C34:3 PC plasmalogen and C40:7 PE plasmalogen, associated with increased risk of breast cancer. The fatty acid component of specific lipids may also reflect dietary or metabolic processes; notably C40:7 PE plasmalogen is highly unsaturated but the position of the double bonds cannot be determined in this metabolomics assay. Among 74 women with breast cancer, the levels of the majority of measured phospholipids (LPC, LPE, PC, and PE) were higher in tumor tissue when compared to normal breast tissue samples 38 . Similar trends were observed in another study comparing tumor to normal tissue in 257 participants with breast cancer, and these lipids were correlated with cancer progression and patient survival 42 . Contrary to our findings, in the European Prospective Investigation into Cancer (EPIC) cohort, PC plasmalogens, including C34:3 PC plasmalogen, were inversely associated with risk of breast cancer 18,19 . However, neither study stratified their results by menopausal status, thus making a direct comparison difficult. Additional studies are needed to evaluate how plasmalogens are associated with risk of breast cancer and if this relationship is modulated by menopausal status.
LPCs are derived from phosphatidylcholines after hydrolysis of one of the fatty acid groups. In the liver, LPCs upregulate genes involved in cholesterol biosynthesis, while circulating LPCs activate many inflammatory and oxidative stress signaling pathways, and are associated with inflammatory diseases such as atherosclerosis and multiple sclerosis 43 . Circulating LPCs have mixed associations with certain cancers. For instance, circulating LPCs are elevated in ovarian cancer patients but depressed in leukemia patients relative to healthy controls 44 ; LPCs showed inverse associations with risk of endometrioid and clear cell ovarian tumors, with stronger inverse associations among premenopausal women, in NHS and NHSII 15,16 . While most LPCs measured in EPIC were inversely associated with risk, one of our top hits was LPC C20:1 that was positively associated with risk 18 . In an earlier study nested within the EPIC cohort of 774 participants including 362 breast cancer cases, C18:0 LPC was identified as inversely associated with breast cancer risk, after adjusting for potential risk factors 19 , though analyses were not stratified by menopausal status.
Our study identified three amino acid derivatives associated with breast cancer risk. High levels of phenylacetylglutamine were associated with decreased breast cancer risk while high levels of asparagine and creatine were associated with increased risk. Phenylacetylglutamine is formed from phenylacetate and glutamine and is found as a normal constituent of human urine 45 . Phenylacetylglutamine is a host microbiome cometabolite associated with bacterial phenylalanine metabolism [46][47][48][49] . Clostridium difficile, F. prausnitzii, Bifidobacterium, Subdoligranulum, and Lactobacillus are all positively associated with hippuric acid 48,50 , while Bifidobacterium is positively associated with phenylacetylglutamine and microbes of the Christensellaceae, Ruminococcaceae, and Lachnospiracaea families are negatively associated with phenylacetylglutamine 47,51 . While not directly linked to breast cancer risk, high serum levels of phenylacetylglutamine is a potential early marker of kidney dysfunction in chronic kidney disease 51 . Glutamine, a precursor to phenylacetylglutamine, has been associated with breast cancer risk in a nested case-control study within the French SU.VI.MAX cohort (n = 211 cases). High levels of glutamine were associated with increased risk (OR per SD increase =1.33, 95% CI: 1.07-1.66) and this association persisted among the subgroup of premenopausal women (p for interaction = 0.003) 52 . Notably, elevated expression of asparagine synthetase correlates with lung metastasis in breast cancer patients 53 . Furthermore, in vitro breast cancer studies show asparagine bioavailability promotes invasion and metastasis and that this effect is prevented by limiting asparagine bioavailability 53 . In a nested case-control study within the EPIC cohort (n = 1624 cases), asparagine was inversely associated with breast cancer risk (OR =  18 . However, the EPIC study participants were overwhelmingly (> 70%) postmenopausal at the time of blood collection, in contrast to our population with 80% premenopausal women. Differences in the menopausal status may partially explain the observed opposite directions of association between asparagine and breast cancer risk.
Creatine is obtained from meat consumption and synthesized endogenously from arginine, glycine, and methionine. Most creatine is found in skeletal muscle, and a significant amount is also found in the brain. Omnivores obtain roughly 50% of their daily creatine from meat and 50% is biosynthesized, while vegetarians biosynthesize most of their creatine 54 and have significantly lower muscular creatine levels than meat eaters 55 . Creatine is broken down to creatinine in a first-order reaction, the rate of which decreases with age and decreased muscle mass 56 . Creatine/creatinine metabolism plays an important role in energy metabolism in skeletal muscle tissue, and thus disturbances in this pathway are associated with many muscle diseases, whether as a cause or consequence 57 . To the best of our knowledge, no previous work has reported a link between creatine and breast cancer risk. However, the association we observed in this analysis is consistent with the positive association between red meat and risk of breast cancer among premenopausal women in the NHSII cohort 58 .
Kynurenic acid and 2-aminohippuric acid are benzenoids inversely associated with breast cancer risk in our study. 2amminohippuric acid is a glycine conjugate of anthranilic acid and can be synthesized in the liver 59 , but little is known about its biological function. Both kynurenic acid and 2-aminohippuric acid are part of the kynurenine branch of the tryptophan pathway, an essential amino acid and a precursor to many biologically active metabolites 60,61 . Studies of the function of kynurenic acid suggest pleiotropic roles in disease. On the one hand, kynurenic acid has been shown to have anti-inflammatory and anti-ulcerative properties in animal models, as well as antioxidative properties in vitro in human cells 62 . On the other hand, circulating kynurenic acid was associated with increased risk of insulin resistance 63 . Kynurenic acid acts as both an anti-inflammatory and immunosuppressive factor which in turn allows tumor proliferation 64 . Tryptophan metabolism plays an important role in tumor progression and malignancy with several cancers expressing tryptophan-degrading enzymes such as IDO1 65,66 . However, the direct role of kynurenic acid remains unclear, with both proliferative and antiproliferative effects on human glioblastoma cells, an antiproliferative effect on human colon adenocarcinoma cells, and decreased DNA synthesis and inhibited migration in both cell types 62 . Neither kynurenic acid nor 2-aminohippuric acid have been previously associated with breast cancer risk.
N2,N2-dimethylguanosine (DMGU) is a purine nucleoside and a primary degradation product of transport RNA. Elevated circulating DMGU levels may indicate cellular stress and is associated with several diseases, including pulmonary arterial hypertension 67 , solid tumors 68 , incident type II diabetes 69 , and all-cause mortality 70 . Elevated levels of N2,N2-dimethylguanosine were found in patients with acute leukemia and breast cancer 71 . Elevated circulating levels of this metabolite are associated with lower risk of breast cancer in our study.
Importantly, there are several studies suggesting lifestyle and dietary changes may impact circulating levels of the here identified metabolites; however, the relationships are complex and warrant further study. For example, vegetables intake was positively associated with piperine (manuscript in preparation) and inversely associated with DMGV 34 but both metabolites are inversely associated with risk. The Mediterranean diet was positively associated with C40:7 PE plasmalogen 72 though this metabolite was positively associated with risk. Further, physical activity is inversely associated with creatine and positively associated with asparagine 73 though we observed positive associations with breast cancer risk for both metabolites. This work supports the fact that these metabolites are potentially modifiable, however the complex relationships with breast cancer risk preclude clear recommendations and more investigation is required.
Our study has several strengths and limitations. Notably, we conducted a prospective analysis of amino acid and amino acidrelated metabolomics and risk of breast cancer among a large number of predominantly premenopausal women. We had detailed information on sample collection characteristics and risk factors, which we included in our statistical approaches. Although metabolomics was measured at only one point in time, the identified metabolites are reasonably stable over time 74 (ICCs or correlation over 1-2 years ≥0.75 for nine metabolites; no data are available for 2-aminohippuric acid and C20:1 LPC). Our cohort consisted of registered nurses, a group that are not representative of the general population (e.g., social economic status), however there is no evidence suggesting that breast carcinogenesis is different in this group of women. Additionally, our cohort also included predominantly Caucasian women. However, while the prevalence of risk factors often differs across population subgroups, many breast cancer risk factors have been documented to operate similarly across ethnic groups, as would be expected from a common underlying biology [75][76][77][78][79][80][81][82][83][84][85][86][87][88] . Nonetheless, it is crucial that future studies conduct similar analyses in racially and ethnically diverse cohorts. While we had reasonable power in most of our analyses, we had limited power among ER negative tumors. Lastly, the uniqueness of our data measured among predominantly premenopausal women make replication studies challenging. We conducted this analysis in a hypothesis generating framework and hope that other cohorts will follow and analyze metabolomics data stratifying by menopausal status.
In summary, we identified eleven metabolites associated with risk of breast cancer among premenopausal women. Seven metabolites remained associated after adjustment for established risk factors and were selected by at least one multivariable modeling approach: higher levels of 2-aminohippuric acid, kynurenic acid, piperine, DMGV and phenylacetylglutamine were associated with lower breast cancer risk while higher levels of creatine and C40:7 PE plasmalogen were associated with increased breast cancer risk. Additional prospective cohort studies are needed to assess these associations considering menopausal status. If these findings are validated, experimental studies are warranted to understand the underlying biological mechanisms driving changes in metabolite levels.

Study population
In 1989, 116,429 female registered nurses aged 25-42 y returned a mailed questionnaire and were enrolled in the NHSII. Participants have been followed biennially since 1989 with questionnaires collecting information on reproductive history, lifestyle factors, diet, medication use, and new disease diagnoses.
In 1996-1999, 29,611 NHSII participants aged 32-54 y contributed blood samples, as previously described 89 . Of these, 18,521 women who had not used oral contraceptives, been pregnant or breastfed in the previous 6 months provided samples timed within the menstrual cycle, targeting the early follicular (days 3-5 of the cycle) and mid-luteal (7-9 days prior to expected start of next cycle) phases. The remaining women donated a single untimed sample. Follicular plasma was separated and frozen by the participants and returned with the luteal sample; samples were collected and shipped overnight to our laboratory where we processed and archived aliquots of white blood cell, red blood cell, and plasma in liquid nitrogen freezers (≤−130°C). Follow-up in the blood subcohort is high (96% in 2011).
The study protocol was approved by the institutional review boards of the Brigham and Women's Hospital and Harvard T.H. Chan School of Public O.A. Zeleznik et al.
Health, and those of participating registries as required. The return of the self-administered questionnaire and blood sample was considered to imply consent.

Case and control selection
Cases of breast cancer were identified after blood collection among women who had no reported cancer (other than nonmelanoma skin). Thousand and fifty-seven cases (invasive cases n = 780) were diagnosed between 1999 and 2011. Breast cancer cases were reported by the participant, which were confirmed by medical record reviews (n = 1015) or verbally by the nurse (n = 42). Given the high confirmation rate by medical record for breast cancer in this cohort (99%), all cases are included in this analysis.
One control was matched per case by the following factors: age (+/− 2 y), menopausal status and postmenopausal hormone therapy (HT) use at blood collection and diagnosis (premenopausal, postmenopausal and not taking HT, postmenopausal and taking HT, and unknown), and month (+/− 1 month), time of day (+/− 2 h), fasting status at blood collection (≤8h after a meal or unknown; >8h), race/ethnicity (African-American, Asian, Hispanic, Caucasian, other) and luteal day (+/− 1 day; timed samples only).

Covariate information
Data on breast cancer risk factors, including anthropometric measures, reproductive history, and lifestyle factors, were collected from questionnaires administered biennially and at the time of blood collections. Case characteristics, including invasive vs. in situ, histologic grade, estrogen and progesterone receptor (ER, PR), were extracted from pathology reports. As previously described 90 , immunohistochemical results for ER and PR, read manually by a study pathologist, were included for cases with available tumor tissue included in tissue microarrays.

Laboratory assay
Plasma metabolites were profiled at the Broad Institute of MIT and Harvard (Cambridge, MA) using a liquid chromatography tandem mass spectrometry (LC-MS) method designed to measure polar metabolites such as amino acids, amino acids derivatives, dipeptides, and other cationic metabolites as described previously 33,74,91,92 . Pooled plasma reference samples were included every 20 samples and results were standardized using the ratio of the value of the sample to the value of the nearest pooled reference multiplied by the median of all reference values for the metabolite. Samples were run together, with matched case-control pairs (as sets) distributed randomly within the batch, and the order of the case and controls within each pair randomly assigned. Therefore, the case and its control were always directly adjacent to each other in the analytic run, thereby limiting variability in platform performance across matched case-control pairs. In addition, 238 quality control (QC) samples, to which the laboratory was blinded, were also profiled. These were randomly distributed among the participants' samples.
Hydrophilic interaction liquid chromatography (HILIC) analyses of water soluble metabolites in the positive ionization mode were conducted using an LC-MS system comprised of a Shimadzu Nexera X2 U-HPLC (Shimadzu Corp.; Marlborough, MA) coupled to a Q Exactive mass spectrometer (Thermo Fisher Scientific; Waltham, MA). Metabolites were extracted from plasma (10 µL) using 90 µL of acetonitrile/methanol/formic acid (74.9:24.9:0.2 v/v/v) containing stable isotope-labeled internal standards (valine-d8, Sigma-Aldrich; St. Louis, MO; and phenylalanine-d8, Cambridge Isotope Laboratories; Andover, MA). The samples were centrifuged (10 min, 9000 × g, 4°C), and the supernatants were injected directly onto a 150 × 2 mm, 3 µm Atlantis HILIC column (Waters; Milford, MA). The column was eluted isocratically at a flow rate of 250 µL/min with 5% mobile phase A (10 mM ammonium formate and 0.1% formic acid in water) for 0.5 min followed by a linear gradient to 40% mobile phase B (acetonitrile with 0.1% formic acid) over 10 min. MS analyses were carried out using electrospray ionization in the positive ion mode using full scan analysis over 70-800 m/z at 70,000 resolution and 3 Hz data acquisition rate. Other MS settings were: sheath gas 40, sweep gas 2, spray voltage 3.5 kV, capillary temperature 350°C, S-lens RF 40, heater temperature 300°C, microscans 1, automatic gain control target 1e6, and maximum ion time 250 ms. Metabolite identities were confirmed using authentic reference standards or reference samples.
In total, 259 known metabolites were measured in this study. Metabolites not passing our previously conducted processing delay pilot study 74 were excluded from this analysis (N = 33). All metabolites (N = 226) included here exhibited good reproducibility within person over 1-2 years 74

Statistical analysis
Metabolite levels were natural logarithm transformed and standardized prior to statistical analysis. Missing values were imputed by one half the lowest observed value per metabolite, for metabolites with <10% missing values (N = 1). Metabolites with >10% missing values (N = 19) were excluded from the main analysis and evaluated in an exploratory analysis. Association of metabolite levels with breast cancer risk was assessed in metabolite-by-metabolite models and in multivariable analyses that included all metabolites simultaneously.
The association of individual metabolites with breast cancer risk was assessed in conditional logistic regression models (1057 cases and 1057 controls). In a simple model, each metabolite was included without adjustment for other factors. In an adjusted model, the following additional factors were included: BMI at age 18, weight change from age 18 to time of blood draw, age at menarche, parity and age at first birth, family history of breast cancer, diagnosis of benign breast disease, physical activity, alcohol consumption, exogenous hormone use and breastfeeding history. Odds ratios (OR) and 95% confidence intervals (95% CI) were estimated for a one-unit (one standard deviation) increase in the logtransformed and standardized metabolites levels.
We performed analyses restricting to premenopausal women at blood collection (838 cases and 838 controls), and analyses stratified by BMI (<25 kg/m 2  . In a sensitivity analysis, we observed similar results between conditional logistic regression and unconditional logistic regression adjusting for the matching factors. Thus, stratified analyses were conducted using unconditional logistic regression, additionally adjusting for the matching factors. To test for effect modifications by BMI and heterogeneity by ER status, we included cross-product terms of the metabolite with potential effect modifier in conditional logistic models and report the p-value for that interaction. As ER status represents a case characteristic, we assigned each control the ER status of its matched case. In an exploratory analysis we assessed the association with risk of breast cancer for the 19 metabolites with >10% missing values. We included the continuous metabolite level as well as a presence/absence indicator in the fully adjusted conditional logistic regression model and performed a likelihood-ratio test (full model compared to a model excluding both the metabolite and the presence/absence indicator) to estimate the significance level of the association.
Multivariable analyses evaluating the joint association of the 207 metabolites with breast cancer risk were based on (1) conditional logistic regression with lasso penalty ("Lasso"), (2) conditional logistic regression with an elastic net penalty ("Elastic Net"), and (3) random forests. In the Lasso and Elastic Net analyses, a minimally adjusted model included only the set of 207 metabolites, whereas a fully adjusted model further adjusted for the risk factors noted above. In each analysis, the optimal values of the regularization parameter(s) were estimated as that which minimizes the average deviance in the left-out partitions, in a 10-fold cross validation procedure. A p-value for each metabolite was obtained from a permutation test in which the case/control labels were permuted within each matched stratum. A p-value for each metabolite was calculated as the proportion of permutations (out of 250) in which the magnitude of the coefficient under label permutation was at least as large as the regression coefficient in the observed dataset. Analyses were carried out using the R library clogitL1 93 .
Random forests analyses included a minimally adjusted model that included the 207 metabolites and matching factors. A fully adjusted model also included the additional risk factors noted above. In all analyses, the parameter mtry corresponding to the number of variables randomly sampled as candidates at each split was set to the square root of the total number of covariates in the model. Each classifier was an aggregate of 5000 trees. A p-value for each metabolite was obtained from a permutation test as described above in which the case/control labels were randomly O.A. Zeleznik et al.
permuted 100 times. Analyses were carried out using the R library random forests 94 .
To adjust for multiple testing in the conditional logistic regression and the multivariable models we estimated the positive FDR based on the qvalue procedure 95 . Metabolites that satisfied a p-value less than 0.05 and corresponding q-value less than 0.20 in the minimally adjusted model were discussed as primary findings.
Criterion for statistical significance: metabolites that met a p-value < 0.05 and q-value < 0.20 in at least one of the four models (Conditional Logistic Regression, Lasso, Elastic Net, and Random Forests) with minimal adjustment were considered as statistically significant. An exception was made for metabolites that did not meet a p-value threshold < 0.05 in the conditional logistic regression: four metabolites that met the threshold for statistical significance in Lasso only but had high raw p-values (>0.3) in the conditional logistic regression were excluded (C12:1 carnitine, C22:5 LPC, C46:2 TAG, glycine).
A metabolite score was estimated for each participant as a linear combination of all metabolites that met the threshold for statistical significance. The coefficients associated with each metabolite were estimated in a conditional logistic regression model with the Lasso penalty that included all metabolites simultaneously and with full adjustment for all potential confounders.
Analyses were caried out using R 96 version 3.6.3 and SAS version 9.4. All statistical tests were two-sided.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The data generated and analyzed during this study are described in the following data record: https://doi.org/10.6084/m9.figshare.14374349 97 . The demographics and questionnaire data together with the plasma metabolomics profiles are not publicly available for the following reason: data contain information that could compromise research participant privacy. Requests to access these data should be made via http:// www.nurseshealthstudy.org/researchers. Analysis results for all metabolites are openly available as part of this figshare metadata record in the files titled "supplementary_tables 1-4.csv".