Proteins are effector molecules that mediate the functions of genes1,2 and modulate comorbidities3,4,5,6,7,8,9,10, behaviors and drug treatments11. They represent an enormous potential resource for personalized, systemic and data-driven diagnosis, prevention, monitoring and treatment. However, the concept of using plasma proteins for individualized health assessment across many health conditions simultaneously has not been tested. Here, we show that plasma protein expression patterns strongly encode for multiple different health states, future disease risks and lifestyle behaviors. We developed and validated protein-phenotype models for 11 different health indicators: liver fat, kidney filtration, percentage body fat, visceral fat mass, lean body mass, cardiopulmonary fitness, physical activity, alcohol consumption, cigarette smoking, diabetes risk and primary cardiovascular event risk. The analyses were prospectively planned, documented and executed at scale on archived samples and clinical data, with a total of ~85 million protein measurements in 16,894 participants. Our proof-of-concept study demonstrates that protein expression patterns reliably encode for many different health issues, and that large-scale protein scanning12,13,14,15,16 coupled with machine learning is viable for the development and future simultaneous delivery of multiple measures of health. We anticipate that, with further validation and the addition of more protein-phenotype models, this approach could enable a single-source, individualized so-called liquid health check.
As populations worldwide are increasingly affected by multimorbidity and avoidable chronic health conditions, the need to prevent illness is increasing17. In response, healthcare providers have instituted preventative medicine programs. For example, the UK National Health Service has implemented a triple prevention strategy18 with initiatives such as Health Check19, Healthier You20 and the National Diabetes Prevention Programme20. The advantages of such approaches are that they are inexpensive, cost effective and scalable20. However, the tools key to making them useful could be improved beyond taking medical history, a limited number of laboratory tests and group participation in health coaching. While the low-cost tests and assessments of lifestyle are prognostic on a population level, long-term adherence is difficult to sustain21 and a process that is not individualized cannot be optimal for everyone.
Applications of big data and systems medicine have been suggested to provide additional information to transform healthcare22,23, but these claims depend on the degree to which the information sought is encoded within the data source and whether it can be easily extracted. There is some evidence for reduced healthcare utilization associated with information-rich physiologic health measurements24, but scalability is limited by the high cost of generating these data. This study evaluates whether protein scanning can fill the gap between contemporary demands for practicality and low cost and the future promise of the impact of personalized, systemic and data-driven medicine.
Proteins regulate biological processes and can integrate the effects of genes with those of the environment, age, comorbidities, behaviors and drugs2. There are about 19,000 human genes coding for approximately 30,000 proteins25. Of these, up to 2,200 proteins enter the bloodstream by purposeful secretion to orchestrate biological processes in health or in disease, including hormones, cytokines, chemokines, adipokines and growth factors26. Other proteins enter plasma through leakage from cell damage and cell death. Both secreted and leakage proteins can inform health status and disease risk. We therefore hypothesized that protein scanning could deliver comprehensive individualized health assessments—but with single-source convenience and greater usability in typical medical practice. While this approach using modified aptamers has gained provenance for discovering and understanding gene–protein interactions1, drug pharmacology11, biological control systems2, biomarkers in individual diseases and risks3,4,5,6,7,8, aging9 and obesity10, it has not been evaluated previously as a potentially holistic, quantitative health assessment for simultaneous evaluation of multiple health issues.
In this proof-of-concept study based on five observational cohorts in 16,894 participants, we evaluated the ability of the scanning of ~5,000 proteins in each plasma sample to simultaneously capture the individualized imprints of current health status, the impact of modifiable behaviors and incident risk of cardiometabolic diseases (diabetes, coronary heart disease, stroke or heart failure).
Models were developed for 11 of 13 predefined health measures; their performance metrics are shown in Table 1 and graphically in Fig. 1. Success was defined as at least equivalent performance of a validated model to the best available comparator (cardiovascular (CV) risk and incident diabetes risk, measured by C-statistic and/or net reclassification index (NRI27,28) (versus reference American College of Cardiology (ACC)/American Heart Association (AHA) risk score)). Where there was no comparator, success was a high degree of correlation with a truth standard (Spearman correlation coefficients >0.6 (that is, r2 > 0.36) or, for binary measures, an area under the curve for receiver operating characteristic (AUC) > 0.7).
For current health states, protein-phenotype model performance metrics in the validation datasets are as follows: predicting presence/absence of liver fat by ultrasound: AUC = 0.83 for proteins, AUC = 0.64 for the best clinical model using age, sex, alcohol, statins and pre-diabetes status; predicting kidney function, estimated glomerular filtration rate (eGFR) above/below 60 ml min–1: AUC = 0.94; predicting percentage body fat (kg) by DEXA: r2 = 0.92 for proteins and 0.74 for the best clinical model using sex, height and weight; predicting visceral fat (kg) by DEXA: r2 = 0.70; predicting lean body mass (kg) by DEXA: r2 = 0.82 for proteins and 0.74 for the best clinical model using age, sex and height; predicting cardiopulmonary fitness (VO2 max ml min–1 kg–1): r2 = 0.71.
For modifiable behaviors, model validation performance metrics are as follows: predicting average daily physical activity energy expenditure (kJ kg–1 d–1) from individually calibrated heart rate and movement sensing: r2 = 0.38; predicting alcohol consumption on self-reported questionnaires above or below UK guidelines of 14 units per week, separate models for men and women: AUC = 0.86 for women and 0.82 for men; predicting current cigarette smoking on self-reported questionnaires: AUC = 0.82.
For future cardiometabolic risks, model validation performance metrics are as follows: predicting incident diabetes in pre-diabetics within 10 years: accuracy 67% versus 61% for the best oral glucose tolerance model trained in the same participants using combined fasting and peak glucose levels; predicting primary CV events (myocardial infarction, stroke, hospitalization for heart failure or CV death) within 5 years: C-statistic of 0.66 and NRI = +0.21 versus the reference 2013 ACC/AHA atherosclerotic CV (ASCVD) risk score, which had a C-statistic of 0.65.
There were two unsuccessful model attempts: we found no significant proteins that predicted future body weight 5 years after blood sampling when evaluated in the incident diabetes subset of Whitehall II; and preliminary model correlations within the Fenland study predicting macronutrient intake by questionnaire (dietary fat, carbohydrate and protein intake) had r2 values of only ~0.1 each.
Overall, each successful model incorporated between 13 and 375 protein measurements, with a total of 891 unique human proteins incorporated across all models. The top three proteins with the largest mathematical contribution to each model, along with their biological relevance to the phenotype, are shown in Table 2, and complete protein lists for all the models can be found in Supplementary Table 1. The proportionate degree of protein overlap across phenotype models is shown in Table 3. Overall, the degree to which proteins in one model were represented in another was modest, with a mean of 12% shared. The most frequently selected individual protein was leptin, which was important for percentage total body fat, visceral fat, physical activity and cardiorespiratory fitness. Within the 110 possible cross-model comparisons in Table 3, only 12 had >25% overlap in proteins shared across models. The highest combined overlap was between visceral fat and liver fat (38% of visceral fat proteins were represented in the liver fat model and (coincidentally) 38% of the liver fat proteins were represented in the visceral fat model). Of the 96 proteins in the model for visceral fat, 29, 29 and 38% were shared with incident diabetes, lean body mass and liver fat, respectively. Of the 115 proteins in the protein-phenotype model for lean body mass, 29, 26 and 26% were shared with the visceral fat, physical activity and VO2 max models, respectively.
In this large proteomic study representing a set of prospectively defined analyses of retrospective, archived samples and data from five well-characterized cohorts, approximately 5,000 proteins were measured in nearly 17,000 participants, resulting in ~85 million individual protein measurements. The results were analyzed rigorously by predefined statistical plans that relied on several state-of-the-art supervised machine learning approaches.
The intent of this proof-of-concept study was to evaluate the potential of protein scanning in becoming a sole information source capable of characterizing multiple elements of an individual’s current health state, modifiable behaviors and future cardiometabolic health risks from a single blood sample. Capturing health information in each of these domains would be a prerequisite for an idea of a future so-called liquid health check.
The objectives were largely fulfilled. Patterns of scanned plasma proteins were validated for six current health states, three behaviors and two key future disease risks. The validation of these protein-phenotype models, each consisting of 13–375 protein measurements involving a total of 891 human proteins, provides proof of concept for a scalable, individualized and holistic proteomic health assessment that might be delivered from plasma proteins alone.
The models we developed predicted results from some of the best clinical or physiological measures relevant to preventative health29,30,31,32,33,34. Acquiring the same information using standard techniques would require physician examination, laboratory testing, exercise stress testing and imaging assessments, with up to nine different patient appointments and potentially thousands of pounds in costs per patient, as shown in Supplementary Table 2. While some of the models demonstrated high performance (for example, the r2 of 0.91 for percentage body fat), others had only modest prognostic power (for example, the C-statistic of 0.66–0.69 for CV events); however, this was still modestly better than traditional risk factors and could also add value in overcoming the incomplete utilization of risk calculation in primary care.
An important feature of our study is the use of a sole information source (that is, a single blood draw) for protein-phenotype models. This was a key objective of our health check proof of concept, and therefore we did not include demographic or known risk factors in the models—unless absolutely necessary to achieve desired performance. This approach enabled the machine learning algorithms to include proteins that represented the biology of clinical and demographic factors where useful. For the same reason, we also did not test whether the models could be further enhanced by the addition of other features (history, physical signs, laboratory tests or genetic information). It is possible that these multi-source models could improve absolute models’ performance, although their inclusion has potential implications for increasing costs and loss of convenience.
Another nonconforming feature of this study is its separation from biological analysis. We did not use any biological plausibility or causality information from the literature for feature selection, because most proteins scanned have never been measured at scale and because some of the proteins in our models are leakage proteins that might inform cell injury rather than biological causality. A full biological analysis of proteins in the models is ongoing; however, this is made complex by the algorithms’ biases for correlated features and their selection of proteins for normalizing adjustments not related to the target physiology. Nevertheless, as a simplified alternative, we present the biological functions of the top three proteins that make the greatest mathematical contribution to each of the 11 successful models in Table 2. All proteins included in the 11 successful models can be perused in Supplementary Table 1, and all proteins measured in Supplementary Table 3. The degree of sharing of proteins across phenotype models shown in Table 3 was modest, averaging 12% (range 0–38%). The individual proteins’ functions and the sharing of proteins between models were largely physiologically plausible. The individual protein with highest impact in multiple models was the appetite and metabolism regulator, leptin, which was included in percentage body fat, visceral fat, physical activity and cardiopulmonary fitness models. The highest overlap across models was the coincident inclusion of proteins in the models of liver fat, visceral fat and incident diabetes.
One limitation of our study is the nature of the truth standards we used for model training. In some cases, other good techniques exist (for example, liver biopsy or magnetic resonance imaging as alternatives to ultrasound for detection of liver fat) but in all cases the chosen reference measures we used have widespread use in medicine. In other cases, self-reported measures such as alcohol and smoking are subject to individuals’ truthfulness, in which case we depended on the careful evaluations made across the cohort studies that can now be applied to individuals.
Another limitation of our study is that the populations’ characteristics may limit the potential generalizability of the results; in particular, a Caucasian bias in some of our cohorts will demand calibration testing in different populations. Similarly, there is a bias in model development thus far towards metabolic health that limits claims of comprehensiveness. An obvious omission here is cancer, to which earlier versions of the SomaScan modified-aptamer assay have been applied35,36, but these cancer findings have not yet been translated to the current, more advanced, platform. Finally, the greatest potential value for such assessments is likely to come from their sensitivity to longitudinal change in health status or risks; future studies will have to investigate this question.
In conclusion, this proof-of-concept study shows that scanned protein expression patterns encode for several markedly different types of health information. It is thus conceivable that, with further validation and the potential for expansion of the number of tests, a comprehensive, holistic health evaluation using a battery of protein models derived from a single blood sample could be performed. The next step is to test the applicability of the protein models that we have derived and validated in observational cohorts under research conditions in real-world healthcare systems.
We prespecified 13 distinct measures of current health, modifiable behaviors and incident disease risks that are recognized by health experts as useful and/or commonly used for preventative health29,30,31,32,33. These have been well characterized in at least one of five independent cohort studies as the truth standards for deriving and validating proteomic model predictions: the UK Whitehall II and Fenland, the Norwegian HUNT3 and the US Covance and HERITAGE Family studies. EDTA plasma samples had been collected from all these studies and the samples were centrifuged and frozen typically 2–10 h after collection, a timeframe that is representative of how blood is handled in typical medical practice. Aliquots of these samples were assayed on the proteomic platform without further processing after transport and thawing.
The study designs and sample selections were from whole cohorts or case-cohort fractions throughout, intended to reduce selection and spectrum biases37,38. The multi-cohort study approach was needed as no single cohort has all the specified clinical measures or outcomes. Protein model outputs were deliberately simplified with primary care practitioners and patients in mind as the key target users. The flowchart for the proteomic program, including the source of the samples, data, model training and replication, is shown in Extended Data Fig. 1. Extended Data Fig. 2 shows details of the five parent cohort studies, and Extended Data Figs. 3–6 the participant characteristics for each model endpoint. The Nature Research Reporting Summary for this study is available as part of the online publication.
The modified aptamer binding reagents12, and SomaScan assay13 and its performance characteristics15,16, have previously been described. The annotated menu for all ~5,000 modified-aptamer binding reagents is shown in Supplementary Table 3.
The SomaScan Assay begins in each well of a 96-well plate, as a mix of thousands of slow off-rate modified aptamers (SOMAmer reagents). These are labeled with a 5′ fluorophore, photocleavable linker and biotin and immobilized on streptavidin-coated beads through biotin–streptavidin interaction. A plasma sample from each participant is diluted and added to each well.
Cognate and nonspecific SOMAmer–protein complexes form on the beads. After washing away unbound proteins, captured proteins are labeled with biotin. SOMAmer–protein complexes are released from the beads by photocleavage of the linker with ultraviolet light and incubated in a buffer containing an unlabeled polyanionic competitor. This competes with the nonspecific binding of the ‘incorrect’ protein to any SOMAmer reagents that dissociate rapidly owing to the fast off-rate of such interactions, whereas the cognate (intended) SOMAmer–protein interaction has a much slower off-rate (this is part of the original reagent selection process). This differential in kinetics, coupled with polyanionic competition, represents a second element of specificity (the first being the high affinity, enhanced by modifications to the aptamers), analogous to the effect of adding a second antibody in a conventional immunoassay.
SOMAmer–protein complexes are recaptured on a second set of streptavidin-coated beads through biotin-labeled proteins, followed by additional washing steps that facilitate further removal of nonspecifically bound SOMAmer reagents. SOMAmer reagents are then released from the complex in a denaturing buffer.
For readout, SOMAmer reagents are hybridized to complementary sequences on a DNA microarray chip and quantified by fluorescence. Fluorescence intensity in the SomaScan assay for each reagent is related to the relative availability of the three-dimensional shape-charge epitope on each protein (the binding site of the SOMAmer reagent) in the original sample. This is a reflection of each protein’s abundance (concentration), the shape of the protein itself (which may be impacted by a genetic variant or by modification) or by a circulating competitor (physiologic or a therapeutic antibody).
Median intra- and interassay coefficients of variation are ~5%16 and assay sensitivity is comparable to that of typical immunoassays, with a median lower limit of detection in the femtomolar range.
Specificity of the modified aptamer reagents has been established in several ways. The binding affinity of 1,612 reagents has been tested against structurally related proteins as described by the manufacturer, in the succeeding paragraphs in this section. Because many proteins share structural and functional features, it is possible that the structural epitope to which a reagent binds is present on proteins other than the one initially used to select the reagent. Indeed, we have observed that a minority of reagents are able to bind with some degree of affinity to highly similar proteins, presumably through such a shared structural epitope, although not always with the same high affinity. Because the assay is performed in a complex biological sample containing thousands of different proteins, experimentally determining which reagents may also target other proteins to some degree can be extremely valuable in interpreting biomarker discovery data.
We first analyzed publicly available databases of known human protein sequences using sequence alignment tools (for example, BLAST) to identify those ‘relevant relative’ proteins that share significant homology with proteins used to select the modified aptamer reagents. Proteins with significant homology to the target protein (that is, proteins with >40% amino acid sequence identity with the target protein) were tested experimentally if available in the inventory or commercially available as full-length proteins from reliable vendors.
Available related proteins were analyzed with affinity-capture experiments similar to immunoprecipitation protocols. Modified aptamer reagents were immobilized on streptavidin-coated beads and then incubated with either the target protein or the identified related protein. The reagent–protein complexes were then washed, and the proteins labeled with a fluorophore. The complexes were then eluted and the recovery of bound versus input protein was analyzed by SDS–PAGE and fluorescent imaging. When any reagent binding to proteins other than the SELEX target was observed, we performed solution-affinity measurements to determine whether the reagent has similar or different affinities for the target protein and related protein. If the solution dissociation constant (Kd) was within tenfold of that for the SELEX target, the reagent was reported to bind the SELEX target and other proteins with ‘similar affinity’. If the measured affinity differed by greater than tenfold, we reported that the reagent binds to the protein(s) other than the SELEX target with ‘at least tenfold weaker affinity’. Although this is a broad statement regarding specific affinity, we do not report exact Kd values because of the high variability observed in both the quality and reported concentrations of commercially obtained purified proteins.
For 73% of cases in which proteins related to the SELEX target were available for testing, we observed binding of the reagent to the specific SELEX target and not to any of the related proteins. For example, a reagent selected to bind the protein tissue inhibitor of metalloproteinase-1 (TIMP-1) was also tested against the related proteins TIMP-2 (60% identical), TIMP-3 (31% identical) and TIMP-4 (40% identical). When this same TIMP-1 SOMAmer reagent was used in affinity enrichment from human plasma, four unique peptides corresponding to endogenous TIMP-1 were identified by liquid chromatography–tandem mass spectrometry in the enriched sample, and no peptides corresponding to any other member of the TIMP protein family were identified. Additionally, no peptides corresponding to TIMP-1 were identified in any other plasma pulldown samples performed using 142 different SOMAmer reagents, including a TIMP-2-specific reagent. In another representative example of highly specific binding, a reagent specific for matrix metalloproteinase-10 (MMP-10) does not bind MMP-12 (61% identical), MMP-13 (57% identical), MMP-3 (80% identical), MMP-1 (61% identical) or MMP-8 (50% identical).
Whenever we observed any binding to proteins other than the SELEX target (27% of the reagents tested) in initial pulldown tests, we followed up with measurements of solution affinity. We typically measure the association of radiolabeled reagent with protein and then capture the complex using a protein-affinity chromatography medium. Saturation-binding curves are then generated by titrating increasing amounts of protein in the presence of a constant, limiting amount of reagent. The Kd is determined to be the protein concentration at which half-maximal binding is observed. In one typical example, initial pulldown tests indicated that one reagent binds not only to its original SELEX target (pyrophosphatase 1 (PPA1)), but also to the related protein PPA2, which shares 68% amino acid sequence identity. However, solution-affinity measurements determined that the affinity was greater than tenfold stronger for PPA1 than for PPA2.
We observed that 13% of the reagents tested bound to members of a protein family with similar affinities. As previously noted, this recognition most often occurs when proteins share extensive sequence identity. Presumably, the structural epitope to which the reagent was selected is highly conserved and biochemically indistinguishable by solution equilibrium-binding affinities. In fact, of the reagents that could bind a related target, ~6% (that is, almost half of the 13%), were products of the same gene with a common epitope (for example, splice variants such as vascular endothelial growth factor 121 and 165 isoforms) or shared subunits in a multi-subunit complex (for example, cyclin-dependent kinase 1/cyclin B1 complex, in which the reagent binds to the cyclin B1 subunit). The remaining ~7% appears to bind to epitopes shared amongst highly related families of proteins. For example, a reagent that binds to its SELEX target calcium/calmodulin-dependent protein kinase II delta (CAMK2D) also binds the closely related proteins CAMK2A (91% identical) and CAMK2B (87% identical). Solution-affinity comparisons determined that this reagent has a similar binding affinity, of ~2 nM, for all three proteins. As expected, the amino acid sequence identity tended to be greater for those pairs that exhibited cross-reactivity: 48% mean for pairs that exhibited no cross-reactivity (no positive pulldown results), 62% for pairs with greater than tenfold lower affinity but positive pulldown results, and 70% for pairs with similar affinity.
In summary, we have tested binding to related proteins for 1,612 modified reagents to date. We were unable to detect binding to any related proteins for 73% of those tested. When binding to related proteins was detected, about half of these reagents exhibited binding to at least one related protein with similar affinity while the other half bound to related proteins, but with at least tenfold weaker affinity. Specific target enrichment by pulldowns from human plasma has been confirmed for 123 of the SOMAmer reagents.
In orthogonal tests of specificity, the effect of cis genetic variants on protein expression in the assay has been published for 552 (ref. 1) and 1,046 (ref. 2) variants, and orthogonal validation by mass spectrometry has been performed for ~1,000 reagents2.
In addition to mitigations arising from reagent specificity and affinity, the impact of nonspecific binding is further reduced through a kinetic challenge during the assay. During a series of wash steps, excess unlabeled polyanion is added (aptamers are also polyanions) which successfully competes with modified aptamers associated with highly abundant plasma proteins with low-affinity, nonspecific binding, and capitalizes on the slow off-rates (disassociation rates) of aptamers from their intended targets.
Derivation and validation of protein-phenotype models
Models of current health state
Liver fat (predicting liver ultrasound result of no fat or excess fat (excess defined as the combined mild/moderate/severe grades of fat))
Within the Fenland study, 10,077 participants underwent liver ultrasound; 75% had no fat and 25% had mild, moderate or severe fat. An elastic net model was derived, refined and validated in 70, 15 and 15% of the entire sample set, respectively.
Kidney filtration (predicting normal or impaired eGFR (≥ or <60 ml min−1))
Within the 2,515 HUNT3 participants in the CV events program, 87% had eGFR ≥60 ml min–1 1.73 m–2 and 13% <60 ml min–1 1.73 m–2 using the creatinine-based CKD–EPI equation39. An elastic net model was derived and refined on 80 and 20% of these participants, respectively. Validation was performed using Covance, an independent sample set with 1,029 participants, of whom 93 and 7% had eGFR of ≥ or <60 ml min–1 1.73 m–2, respectively.
Body composition (predicting dual-energy X-ray absorption (DEXA) components)
Within the Fenland study, 11,471 participants had DEXA scans to assess percentage body fat, lean body mass (kg) and visceral fat (kg), although the last of these was not measurable in 20 subjects. An elastic net linear regression model with continuous output on the same scales as the original measurements was derived, refined and validated on 70, 15 and 15%, respectively, of the total population.
Cardiopulmonary fitness (predicting maximal oxygen uptake on a treadmill (VO2 max), ml kg−1 min−1)
Within the HERITAGE Family study, 648 participants completed maximal exercise tests and had blood samples and measures of VO2 max at baseline and after a 20-week exercise-regimen. An unpaired cross-over sampling method (with 50% of samples from participants at baseline and 50% from participants post-exercise) was used to avoid correlation from pairs and to increase the observed range of fitness values in the dataset. An elastic net linear regression model was derived, refined and validated on 80, 10 and 10% of participants, respectively.
Modifiable behavioral factors
Alcohol consumption (predicting self-reported consumption above or below UK guidelines (14 units/week for men and women))
Within the Fenland study there were 4,851 women, of whom 11% reported consumption above UK guidelines, and 4,803 men, of whom 31% reported consumption above guidelines. Elastic net regression models were derived, refined and validated using the same 70/15/15% sample distribution; separate models were created for men and women to account for residual error differences associated with participants’ sex.
Physical activity (predicting average daily physical activity energy expenditure estimated from combined heart rate and movement sensing for 1 week (kJ kg−1 d−1 or kcal d−1))
This was calculated for the 11,695 participants within the Fenland study with this measure available, using the same 70/15/15% fractions for derivation/refinement/validation as for body composition. An elastic net linear regression model was validated with a kcal d–1 output.
Current cigarette smoking (predicting self-reported questionnaire results)
Of the 1,025 Covance participants 15% self-reported as current smokers and 85% former or never smokers. An elastic net regression model was derived and validated in 80 and 20% of the participants, respectively.
Future cardiometabolic health risks
Incident diabetes (predicting future diagnosis in people with pre-diabetes)
There were 413 participants within the Whitehall II study at baseline who had pre-diabetic fasting glucose (5.5–6.9 mmol l–1) or elevated 2-h glucose (7.8–11.0 mmol l–1) during an oral glucose tolerance test, of whom 23% became diabetic within 10 years. An elastic net Cox proportional hazards model was derived on 80% of this pre-diabetic fraction and then validated on a 20% blinded holdout fraction. A decision risk threshold of greater than threefold (in reference to the average risk score in all Whitehall participants in our study, not just the pre-diabetic fraction) was defined and applied to the pre-diabetic participants.
Incident CV events (predicting any type of first event or CV death within 5 years)
A fully parametric accelerated failure time (AFT) survival model was derived from HUNT3 using a case-cohort design. There were 1,050 cases with an incident ‘hard’ CV event (CV death, myocardial infarction, stroke or hospitalization for heart failure) and a random fraction of 1,414 participants selected from the overall cohort, for a total of 2,464 participants. The model was derived and refined on 80 and 20% of HUNT3, respectively. It was validated in Whitehall II using samples from all 101 cases with an incident CV event within 5 years and a random fraction of the cohort (164 participants) without an incident CV event within 5 years. The model is capable of relative risk stratification ranging from ≤one- to ≥sixfold compared to low-risk individuals at an absolute event rate of <2.5% in 5 years.
Quality control and data normalization
All samples from all studies were run on the SomaScan assay, and standard SomaLogic normalization, calibration and data quality control processes were applied as described in detail below.
Quality control over the first year of production for the SomaScan V4 Assay was performed on an average of 2,000 samples per week using 24 assay runs, which include 11 control replicates from three control lots and a maximum of 85 samples per run. Reference standards, expected values for each protein control replicate lot for each SOMAmer reagent, are derived during assay qualification. Five calibrator replicates per run are used with a reference standard to control for batch effects. Three quality control replicates per run are used with a reference standard to evaluate the accuracy of the assay after data standardization. Standard assay run acceptance criteria require that 85% of the content is accurate to within 20% of the reference; in practice, an average of 96% of the content meets the acceptance standard. The lifetime median precision of the assay over ~3,000 plasma quality control replicates and 5,207 SOMAmer reagents to protein targets is 6.2% (fifth percentile, 3.4%; 95th percentile, 19.1%). In addition to standard acceptance criteria, alternate assay summary metrics—including overall run signal bias from the reference, calibration scale factor percentage outside of 0.6–1.4, quality control replicate five-plate running precision and buffer background or estimated lower limit of detection—are monitored for failures or trends over time on a daily basis by production bioinformatics and quality assurance.
To correct for assay-intrinsic variation such as that due to minor variation in sample dilutions by the pipetting robot, we have generally used (in previous studies) typical median normalization—scaling the total fluorescence from a given sample to the median on the same 96-well assay plate. This has two limitations: first, the scaling of any one sample can be impacted by the other samples on the plate that establish the median; second, there are assay-extrinsic sources of variation in the sample that can affect overall fluorescence, such as sample quality (where plasma from samples with lysed cells is ‘brighter’ because of the leakage of intracellular proteins) and kidney function (where lower filtration rates lead to the elevation of a large proportion of the proteome and again ‘brighter’ samples). In this study, both these limitations were overcome: the former by using an external reference for the median, rather than the other samples on the same plate, and the latter by restricting the analytes used for normalization to those not impacted by sample quality or disease. This was accomplished by comparing each analyte in a new sample to its counterpart in a reference well-collected ‘healthy’ population (the Covance study described in this manuscript). The subset of analytes in the test sample that were within the expected population distribution of fluorescence in the reference sample were used for calculation of the normalization scale factors.
Statistical analysis and machine learning
Statistical analysis plans for each model were prospectively documented and filed to an auditable software regulatory document vault (Veeva Vault (Veeva, Inc.)) before analysis, such that the studies became ‘virtual prospective trials’ on retrospectively assayed, archived samples. Sample-size calculations were not carried out prospectively as the probable effect sizes were hitherto unknown.
Supervised machine learning is the process whereby a computer uses an algorithm applied to data to ‘train’ a model—to derive a fixed equation relating the features chosen to a predesignated truth standard. The algorithm makes predictions on the training data, the error between predicted and actual values of the truth standard is assessed and the algorithm is applied iteratively with small changes in parameters to reduce the predictive error. Learning can stop when the algorithm achieves its highest level of performance assessed after cross-validation (multiple iterations of model assessment on different splits in the training data). In this study, the features in a model are the protein measurements and the truth standards are the health outcomes or measures of behavior.
When developing predictive models using machine learning techniques, to avoid over-fitting it is common practice to use multiple datasets or fractions of datasets to identify and test or validate the model that has the most reliable predictive capabilities. To this end, we applied the following tactics for splitting data. If the dataset is large (thousands, for example, Fenland), the data are split into three sets: a derivation set, used for identifying top models through cross-validation (typically a 70% fraction and five repeats of tenfold cross-validation), a refinement set (a second derivation set that allows us to tune the parameters of the top models, typically 15%) and a validation set (a holdout set that is used only to assess the final model and is not used for model development, typically 15%). If the dataset is smaller (hundreds, for example, Covance), the data are split into two sets: a derivation set of 80% that again uses cross-validation (typically tenfold 90/10% derivation/refinement splits within that 80% fraction) and a validation set of 20% not used for model development. If the dataset contains pairs of samples from the same subjects (for example, Heritage), the data are split into two sets: a derivation set (80–90%) and a validation set (10–20%). Within the derivation set, the model is derived on time point 1 from half the participants and on time point 2 from the other half (avoiding pairs of samples from the same participants). The model is verified on samples with the opposite time points in the same participants, and then validated in the holdout test set data not used for derivation.
Because of the intent to test the extent to which proteins could be a sole information source, demographic features or other laboratory test results were deliberately excluded from the feature selection process, with two exceptions: (1) if the predefined minimum performance could not be reached, the most impactful demographic factor could be added; and (2) if the residual errors within a model were related to a demographic feature. In practice, these exceptions were triggered only twice: to include age interactions in the CV model to exceed the performance of the 2013 ACC/AHA ASCVD risk score, and to use sex to create separate alcohol models for men and women to overcome a sex-related residual error distribution.
The sequence of events for model development was initiated with the definition and documentation of the analysis plan, the truth standard (the variable against which the model is trained) and minimum acceptable performance standard for a model. This was followed by normalization and calibration of the proteins measured in the datasets, the assessment of sample quality, the exclusion of any measured proteins failing to meet the quality control measures described above from model development, and the division of the available datasets into training, refinement and validation as shown in Extended Data Fig. 1.
This was followed by univariate ranking and filtering of proteins’ statistical association with the truth standard within the training data, and automated application to the training data of several different types of machine learning algorithms with different methodological approaches40,41.
A semi-automated approach to univariate testing and machine learning analyses was designed to understand efficiently whether there is any evidence of signal for the endpoint of interest, and to identify the model type that is the best match for the data. The derivation dataset was used for univariate tests and preliminary machine learning models.
For continuous measurements (lean body mass, percentage body fat, alcohol consumption, energy expenditure from physical activity, visceral adipose tissue, cardiopulmonary fitness VO2 max, weight trajectory and OGTT) we used regression methods. The associations between each analyte and endpoint (lean body mass or percentage body fat), on a univariate level, were assessed using the univariate tests for coefficients/importance metrics from linear and robust regression models, Spearman’s correlation coefficient and random forest (importance scores calculated). Following the univariate analyses, candidate features were ranked based on false discovery rate (FDR)-corrected P values. At this stage, fairly lenient FDR-corrected P values of 0.1 or even 0.2 were used to enrich the lists because the truly multivariate models would not depend on univariate significance, but nonetheless there is a need to perform some reduction in dimensionality. Using this subset of features, the following types of models were fit: elastic net linear models (which combine LASSO and ridge penalties for feature reduction), support vector machines (which are more robust to outliers) and random forests (a nonlinear, tree-based approach).
For dichotomous measurements (liver fat, current cigarette smoking and kidney filtration) we used classification methods. The associations between each analyte and endpoint (liver steatosis, cigarette smoking or kidney filtration), on a univariate level, were assessed using t-tests, Mann–Whitney, logistic regression and random forest (importance scores calculated). The same approach to utilizing univariate FDR-corrected P value ranking to aid dimensionality reduction was used as for continuous measurements. For the preliminary multivariate models, five repeats of tenfold cross-validation were used in derivation. The following multivariate, machine learning models were then run: elastic net logistic regression model (which combines LASSO and ridge penalties for feature reduction), linear discriminant analysis (similar to Naïve Bayes, but handles correlated features better) and random forest (a nonlinear, tree-based approach).
For survival data (diabetes diagnosis within 10 years and CV primary event risk), we used survival models. The association between each aptamer and the rate of diagnosis (binary outcome and time to event or censoring) on a univariate level was assessed using AFT survival models and Cox proportional hazards models. Again, FDR-corrected P values were used to reduce the number of candidate features to 200. This reduction was done so that the AFT and Cox proportional hazards algorithms converged. For the preliminary multivariate models, five repeats of tenfold cross-validation were used. The following multivariate, machine learning models were run: elastic net AFT models (which combine the ridge and LASSO penalties) and proportional hazards elastic net models.
Given that the elastic nets routine consistently gave the best result and was ultimately selected for each model, we describe here the processes specific to that algorithm. There are two penalization parameters (variables that add a penalty to each new feature added to a model). The first parameter is associated with specific penalization of any correlated features, and the second is associated with penalization of the overall number of features in the model. Without such penalization, some algorithms would include all the measured proteins. Readers more familiar with the LASSO algorithm may be interested to know that it is equivalent to setting the elastic nets correlated feature penalty to its maximum setting so that these are eliminated40. In contrast, elastic nets allows the inclusion of more correlated features as that penalty is reduced. The optimal values of these parameters are determined during the cross-validation phase during which each of the two parameters are varied at fixed increments, and model performance is assessed for each combination of settings. The parameter values associated with the model that has the best predictive performance are then selected as the final values.
During model refinement and before validation, advanced feature selection techniques were applied to the features that passed the FDR cutoff, such as forward selection, backward selection and stability selection. Ensemble methods and approaches were employed to develop the optimal model. In the cross-validation stage, models were optimized based on AUC, sensitivity and specificity for classification and survival models, and adjusted r2 values for continuous endpoint models. For survival models, the C-index, Brier score and net NRI38 were also examined. These predictive metrics were confirmed in analyzing the test or holdout datasets. The number of features within a model was determined simply by the algorithmic selection of the optimal number (for example, by elastic net or LASSO).
The best derived models from the previous step were then examined in more detail. Each of the best models was assessed to determine whether the predefined performance standard could be met without the addition of nonprotein features. Additionally, unwanted associations of errors with sex or sample quality were evaluated, and decision thresholds (or risk-bins) defined to stratify the populations in a simple but informative way.
Validation was performed by applying the final model from derivation and refinement, with its predefined decision thresholds, to the validation dataset. Ideally this would be a truly independent replication dataset (such as with the CV, kidney models). However, where such a matching dataset was not available at this time, a random fraction of the same study (10–20% depending on study size) with data not used in training was used for testing the predictive accuracy of the model.
The restriction to people with pre-diabetes for the incident diabetes prediction model reflected the intended-use population for the first clinical application and the assumption that a diabetes prognostic model would be highly impacted by pre-diabetes status. Further results of the diabetes, kidney and CV models are described in Supplementary Tables 4–6. All other models were derived in the general study populations (Extended Data Figs. 2–6), with performance in the participants with pre-diabetes (typically >30%) confirmed when possible.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Pre-existing data access policies for each of the five parent cohort studies specify that research data requests can be submitted to each steering committee; these will be promptly reviewed for confidentiality or intellectual property restrictions and will not unreasonably be refused. Individual-level patient or protein data may further be restricted by consent, confidentiality or privacy laws/considerations. These policies apply to both clinical and proteomic data.
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
Emilsson, V. et al. Co-regulatory networks of human serum proteins link genetics to disease. Science 361, 769–773. (2018).
Tasaki, S. et al. Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nat. Commun. 9, 2755 (2018).
O’Dwyer, D. N. et al. The peripheral blood proteome signature of idiopathic pulmonary fibrosis is distinct from normal and is associated with novel immunological processes. Sci. Rep. 7, 46560 (2017).
Christensson, A. et al. The impact of the glomerular filtration rate on the human plasma proteome. Proteom. Clin. Appl. 12, e1700067 (2018).
Ganz, P. et al. Development and validation of a protein-based risk score for cardiovascular outcomes among patients with stable coronary heart disease. J. Am. Med. Assoc. 315, 2532–2541 (2016).
Wood, G. C., Chu, X. & Argyropoulos, G. et al. A multi-component classifier for nonalcoholic fatty liver disease (NAFLD) based on genomic, proteomic, and phenomic data domains. Sci. Rep. 7, 43238 (2017).
Han, Z. et al. Validation of a novel modified aptamer-based array proteomic platform in patients with end-stage renal disease. Diagnostics (Basel) 8, 71 (2018).
Menni, C. et al. Circulating proteomic signatures of chronolological age. J. Gerontol. A 70, 809–816 (2014).
Thrush, A. et al. Diet-resistant obesity is characterized by a distinct plasma proteomic signature and impaired muscle fiber metabolism. Int. J. Obes. 42, 353–362 (2018).
Williams, S. A. et al. Improving assessment of drug safety through proteomics: early detection and mechanistic characterization of the unforeseen harmful effects of torcetrapib. Circulation 137, 999–1010 (2018).
Rohloff, J. C. et al. Nucleic acid ligands with protein-like side chains: modified aptamers and their use as diagnostic and therapeutic agents. Mol. Ther. Nucleic Acids 3, e201 (2014).
Gold, L. et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS ONE 5, e15004 (2010).
Brody, E. et al. Life’s simple measures: unlocking the proteome. J. Mol. Biol. 422, 595–606 (2012).
Kim, C. H. et al. Stability and reproducibility of proteomic profiles measured with an aptamer-based platform. Sci. Rep. 8, 8382 (2018).
Candia, J. et al. Assessment of variability in the SOMAscan assay. Sci. Rep. 7, 14248 (2017).
Collaborators GBDRF, Forouzanfar, M. H. et al. Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks in 188 countries, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet 386, 2287–2323 (2015).
Maruthappu, M. Delivering triple prevention: a health system responsibility. Lancet Diabetes Endocrinol. 4, 299–301 (2016).
Robson, J. et al. The NHS Health Check in England: an evaluation of the first 4 years. BMJ Open 6, e008840 (2016).
Valabhji, J. et al. Efficacy and effectiveness of screen and treat policies in prevention of type 2 diabetes: systematic review and meta-analysis of screening tests and interventions. BMJ 356, i6538 (2017).
Middleton, K. R., Anton, S. D. & Perri, M. G. Long-term adherence to health behavior change. Am. J. Lifestyle Med. 7, 395–404 (2013).
Dimitrov, D. V. Medical internet of things and big data in healthcare. Health Inf. Res. 22, 156–163 (2016).
Flores, M., Glusman, G., Brogaard, K., Price, N. D. & Hood, L. P4 medicine: how systems medicine will transform the healthcare sector and society. Per. Med. 10, 565–576 (2013).
Musich, S., Wang, S., Hawkins, K. & Klemes, A. The impact of personalized preventive care on health care quality, utilization, and expenditures. Popul. Health Manag. 19, 389–397. (2016).
Ezkurdia, I. et al. Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum. Mol. Genet. 23, 5866–5878 (2014).
Lin, H. et al. Discovery of a cytokine and its receptor by functional screening of the extracellular proteome. Science 320, 807–811 (2008).
Harrell, F. E. Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer, 2015).
Pencina, Michael J. et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat. Med. 27, 157–172 (2008).
Fielding, C. M. & Angulo, P. Hepatic steatosis and steatohepatitis: are they really two distinct entities? Curr. Hepatol. Rep. 13, 151–158 (2014).
Yki-Jarvinen, H. Non-alcoholic fatty liver disease as a cause and a consequence of metabolic syndrome. Lancet Diabetes Endocrinol. 2, 901–910. (2014).
Shuster, A., Patlas, M., Pinthus, J. H. & Mourtzakis, M. The clinical importance of visceral adiposity: a critical review of methods for visceral adipose tissue analysis. Br. J. Radiol. 85, 1–10 (2012).
Ross, R. et al. Importance of assessing cardiorespiratory fitness in clinical practice: a case for fitness as a clinical vital sign: a scientific statement from the American Heart Association. Circulation 134, e653–e699 (2016).
de Souza de Silva, C. G. et al. Association between cardiorespiratory fitness, obesity, and health care costs: The Veterans Exercise Testing Study. Int. J. Obes. (Lond.) https://doi.org/10.1038/s41366-018-0257-0 (2018).
Hobbs, F. D., Jukema, J. W., Da Silva, P. M., McCormack, T. & Catapano, A. L. Barriers to cardiovascular disease risk scoring and primary prevention in Europe. QJM 103, 727–739 (2010).
Ostroff, R. M. et al. Unlocking biomarker discovery: large scale application of aptamer proteomic technology for early detection of lung cancer. PLoS ONE 5, e15003 (2010).
Ostroff, R. M. et al. Early detection of malignant pleural mesothelioma in asbestos-exposed individuals with a noninvasive proteomics-based surveillance tool. PLoS ONE 7, e46091 (2012).
Usher-Smith, J. A., Sharp, S. J. & Griffin, S. J. The spectrum effect in tests for risk prediction, screening, and diagnosis. BMJ 353, i3139 (2016).
Ganna, A. et al. Risk prediction measures for case-cohort and nested case-control designs: an application to cardiovascular disease. Am. J. Epidemiol. 175, 715–724 (2012).
Levey, A. S. et al. A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 150, 604–612 (2009).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005).
Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–288 (1996).
The Whitehall II study is supported by the UK Medical Research Council UK (no. MR/R024227/1, to M.K.), the US National Institutes on Aging (NIH, nos. US R01AG056477, R01AG062553) to M.K. and the British Heart Foundation (no. RG/16/11/32334) to M.J.S. A.D.H. is a NIHR Senior Investigator and was also supported, in part, by the National Institute for Health Research University College London Hospitals Biomedical Research Centre and the UCL BHF Research Accelerator (AA/18/6/34223). FENLAND (the Fenland study, no. 10.22025/2017.10.101.00001) is funded by UK Medical Research Council (no. MC_UU_12015/1), and N.W. is a NIHR senior investigator. We also thank the Fenland Study Investigators, Fenland Study Co-ordination team and the Epidemiology Field, Data and Laboratory teams. HUNT3 is funded by the Norwegian Ministry of Health, Norwegian University of Science and Technology and Norwegian Research Council, Central Norway Regional Health Authority, the Nord-Trondelag County Council and the Norwegian Institute of Public Health. The HERITAGE Family study was funded by the US National Heart, Lung and Blood Institute grants (NIH/NHLBI, no. R01HL146462 to M.A.S.) and no. HL45670 (HERITAGE, to C.B.). All authors are grateful to all volunteers/participants in all of the cohorts, and to the general practitioners, other physicans and practice staff for assistance with recruitment. SomaScan assays and the Covance study were funded by SomaLogic, Inc. The authors also thank A. Lowell (leader of the SomaLogic assay team), D. Perry for the bioinformatics of quality control, J. Williams for the agreements with the study institutions and J. Zach for clinical data organization and management.
The SomaLogic co-authors (S.W., L.A., J.A., T.B., J.C., G.D., R.K.D., Y.H., M.H., R.O. and S.W.) were/are all employees of SomaLogic, Inc., which has a commercial interest in the results. N.W. and C.L. declared that SomaLogic, Inc. has given a grant to the University of Cambridge. P.G. is a member of the SomaLogic Medical Advisory board, for which he receives no remuneration of any kind. The remaining authors (M.K., A.H., J.P.C., C.B., C.J., M.S. and M.S.) have no competing interests.
Peer review information Jennifer Sargent was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 Descriptors of parent studies and fractions used for model derivation and validation.
Solid black arrows designate how fractions of samples and clinical data were utilized independently; blue dashed arrows designate the validation of finalized models either in new fractions of the same dataset or in independent datasets. eGFR = estimated glomerular filtration rate; VO2max. = maximum rate of oxygen consumption; kg. = kilograms. *For Fenland, the precise numbers available for 70%/15%/15% fractions depended on the numbers of participants with data for each endpoint as follows: n=9654 for self-reported alcohol units, n = 11,471 with DEXA scans for body composition, n=10,077 with ultrasound for liver fat, n=11,695 with individually calibrated heart rate and movement sensing for caloric expenditure due to physical activity. **For HERITAGE the model was trained on the pre-training time point from half the 523 participants and the post training time point from the other half of the participants. The model was tested on samples with the opposite time points in the same participants and finally replicated in the 10% fraction not used for training.
Details of the 5 parent cohort studies.
Participant characteristics for current health state models.
Participant characteristics for current state body composition models.
Participant characteristics for modifiable behavioral factors models.
Participant characteristics for future metabolic health risks models.
About this article
Cite this article
Williams, S.A., Kivimaki, M., Langenberg, C. et al. Plasma protein patterns as comprehensive indicators of health. Nat Med 25, 1851–1857 (2019) doi:10.1038/s41591-019-0665-2
Nature Medicine (2019)