Our blood transports many chemicals besides oxygen and carbon dioxide. Some of these molecules provide useful indicators of the state of our health. Indeed, measuring such biomarkers is a common feature of clinical blood tests. Other molecules present, such as hormones and drugs, directly affect health by modulating processes such as metabolism and immune responses. Writing in Nature, Bar et al.1 shed light on the factors that affect the recipe for human blood’s chemical brew.
The origin of most blood-borne molecules, and why they vary in concentration between individuals, is unknown. The list of possible regulators is long: for any given molecule, diet, drugs, medical conditions and history, genetic variants and gut microorganisms might all have a role. Furthermore, these factors can interact, as is the case for trimethylamine oxide. This molecule, which promotes the artery-narrowing disease atherosclerosis, is generated as a result of the metabolism, by both microbes and their host, of certain dietary compounds that are abundant in red meat2. For molecules such as this, which directly affect health, understanding their metabolic regulation might help to yield new clinical treatments.
Bar et al. describe their efforts to tackle the question of what factors govern the molecules present in blood. This work requires not only measurement of the many variables potentially involved, but also the use of analytical methods that can capture complexity — such as the interactions between variables — while still ensuring that valid predictions can be made for individuals outside the study population.
The authors began by characterizing blood samples from a group of 491 healthy individuals in great detail. They quantified the molecules in serum — the liquid component of blood that remains after the proteins needed for clotting have been removed. The study participants provided detailed health information, and answered questionnaires about diet and lifestyle. They also gave stool samples, which were used for DNA sequencing, to determine the genetic signatures of the gut microbes present (also known as the microbiome).
As the authors acknowledge, this is a small study group by the standards of genome-wide association studies, which seek to find connections between genes and disease. Bar et al. are also not the first to link serum molecules to genetic variation or the microbiome3,4. However, the authors’ analysis of this group of individuals is unique in the number of data types that were systematically collected to investigate serum composition.
Next, Bar et al. used a machine-learning approach to link factors such as human genetics and microbiome information to the molecules in the blood. By carrying out many analyses omitting different data subsets, the authors found that diet, the microbiome and clinical variables such as prescription-drug use and blood pressure had the most associations with serum molecules. Although the authors found some genetic associations, confirming 46 previously reported gene–metabolite links, they concluded that the association effects for genetic factors were smaller than were those for diet, clinical variables and the microbiome. These various data types are not exactly comparable, but the authors’ estimates of the genetic effects are in line with results from previous work, providing support for their conclusion that diet and the microbiome have larger and more pervasive influences on serum composition than do genetic factors.
Diet and the microbiome could predict the data for some molecules with similar levels of accuracy, as would be expected, given that diet can affect microbiome composition. But Bar and colleagues showed that these data types provide non-overlapping information, too. For example, dietary information uniquely predicted particular metabolites associated with the consumption of citrus fruit, whereas the presence of a type of microbe belonging to the Lachnospiraceae family strongly predicted the presence of indoxyl sulfate — a bacterial breakdown product of the amino acid tryptophan, previously linked to diseases of the kidney and vasculature5.
To make predictions about the concentrations of molecules present in blood samples, Bar et al. used a machine-learning method called gradient-boosted decision trees, which can capture complex interactions. Decision trees learn simple ‘if–then’ rules to make predictions (Fig. 1). This method layers individual decision trees, successively improving them by training new models that focus specifically on reducing the prediction errors of the older ones.
Bar and colleagues interpreted these models using an approach called feature-attribution analysis. This yields specific hypotheses about how individual factors, such as microbes, foods and genetic variants, influence a particular prediction, here, the molecular composition of blood. More-complex models can be prone to ‘overfitting’ — making erroneous predictions that are based on noise or irrelevant details. The authors therefore fitted and evaluated their models conservatively, but, even more importantly, they confirmed many of their predicted microbe-to-metabolite links in two large, independent study groups. Finally, Bar et al. tested one set of their predictions in a smaller study, identifying molecules (cytosine and betaine) associated with the consumption of wholewheat bread, and then showing that individuals randomly assigned to eat the bread had the expected changes in these metabolites.
This study is comprehensive, but plenty of room remains for future exploration. The authors used the well-validated and standardized Metabolon platform to measure serum metabolites, but no such metabolomic analysis method can cover the full range of blood-borne compounds. Certain types of molecule, such as blood lipids, might therefore be under-sampled compared with others. This might explain why the authors mostly detected metabolite associations with only one of the two most abundant lineages of gut bacteria6,7. Metabolomics can detect molecules whose identity is unknown beyond their molecular weight, and, indeed, the authors report several associations with such unknown metabolites. Although these might point to previously unknown aspects of biology (interestingly, for example, one such association was linked to the age of the participant), without metabolite identification, only limited conclusions can be drawn.
The authors’ microbiome data provide DNA information for all the genomes present in stool extracts. However, Bar et al. distil these data down to the level of abundances of bacterial species, excluding non-bacteria such as yeasts or protozoan organisms. Limiting analyses to the species level also obscures the fact that strains of the same bacterial species can differ in gene content. For example, the metabolism of the drug digoxin in vivo by the bacterium Eggerthella lenta requires a gene that is present in only certain strains of E. lenta8. Finally, the authors were unable to link serum metabolites to specific bacterial enzymes responsible for their generation, which would have helped to connect the associated links to the underlying molecular mechanisms.
These limitations should not detract from the most useful aspect of this paper. By making the full data set available to the research community, Bar and colleagues could help enable the development of future computational methods, potentially resolving some of these limitations, or even providing ways to answer new questions. Their data are likely to be a rich and valuable resource for scientists interested in the mechanisms by which diet, the microbiome and genetics affect our biochemistry and physiology.