Main

Venous thromboembolism (VTE) is a major source of healthcare cost1, morbidity2 and mortality in patients with cancer3. Prophylactic anticoagulation lowers the risk of VTE4,5,6,7,8, although defining the patients most likely to benefit is challenging. Expert guidelines9,10 recommend offering prophylaxis based on the Khorana score (KS)11, a validated cancer VTE risk stratification measure based on hematologic and clinical parameters. However, of patients with a high-risk KS, 10% or fewer develop VTE6,7; many patients receive unnecessary anticoagulation with KS-guided prophylaxis. Furthermore, many patients who develop VTE do not have a high-risk KS12,13,14. Models adding clinical features15,16 based on large observational datasets, such as a recently published risk assessment model (RAM)15, may increase sensitivity and specificity, although these gains are modest17. Furthermore, such risk scores require provider assessment or electronic health record (EHR) integration; thus, most providers do not assess patients for VTE risk18,19. Genetic20,21 microparticle22 and proteomic23 approaches show promise in risk-stratifying patients for VTE but, to date, are not readily deployed in practice. An accurate, easily integrated VTE risk stratification system would be helpful for identifying which patients would benefit from prophylactic anticoagulation or its de-escalation.

Circulating tumor DNA (ctDNA) sequencing assays (‘liquid biopsies’ (LBs)) are increasingly deployed in clinic, with multiple US Food and Drug Administration (FDA) approvals for matching to molecularly targeted therapy24. In patients already receiving ctDNA sequencing, an LB-based VTE risk score, if prognostically valid, could, thus, be provided without additional overhead to the patient or clinician. Preliminary data suggest that cell-free DNA (cfDNA), which may consist of tumor or wild-type DNA, is thrombogenic, at least in part due to its association with neutrophil extracellular traps (NETs)25,26. ctDNA detection is associated with worse survival likely due to more aggressive tumor physiology, although whether it is associated with VTE is unknown27,28,29,30,31. We performed an observational study in non-overlapping cohorts of patients with cancer undergoing ctDNA sequencing with two main goals: (1) to determine whether ctDNA is associated with VTE and (2) to develop and test whether DNA LB-based machine learning models can predict VTE.

Results

ctDNA and VTE

We studied three cohorts: a discovery cohort (n = 4,141) and a prospective validation cohort (n = 1,426) of patients with ctDNA sequencing at Memorial Sloan Kettering (MSK) with any cancer type and sequencing using a New York State-approved assay (MSK-ACCESS32) and a generalizability cohort (n = 463) of patients with advanced non-small cell lung cancer (NSCLC) at MSK and GenesisCare, a community oncology setting in Sydney, Australia, sequenced using an FDA-approved commercial assay (ctDx Lung; Methods: ‘Cohort selection’ and Extended Data Fig. 1). Both assays are indicated for matching patients to molecularly targeted therapy, either in the treatment-naive setting or after progression of disease on previous treatment27.

Patient characteristics are presented in Table 1. A total of 464 (11%) patients in the discovery cohort, 118 (8%) patients in the validation cohort and 98 (21%) patients in the generalizability cohort developed VTE after plasma draw. Patients with 53 different cancer types were included (Supplementary Table 1). As expected33, NSCLC and pancreatic and hepatobiliary cancers were associated with higher VTE rates, whereas melanoma, breast and colorectal cancers were associated with lower VTE rates (Extended Data Fig. 2).

Table 1 Patient characteristics

We performed time-to-VTE analyses with death as a competing risk, excluding patients with VTE before ctDNA sequencing. In the discovery cohort, ctDNA detection was associated with higher rates of VTE (hazard ratio (HR) = 2.49, 95% confidence interval (CI): 1.99–3.11, P < 0.001; Fig. 1a). Patients with higher variant allele fraction (VAF)—that is, proportion plasma DNA attributable to tumor—had higher rates of VTE, suggesting a dose-dependent relationship between ctDNA and VTE risk (Fig. 1b). The association between ctDNA and VTE rates held in subgroup analyses for NSCLC and for melanoma, pancreatic and less represented cancers but not in bladder, hepatobiliary and colorectal cancers (Fig. 1c). In contrast, there was some evidence that the plasma cfDNA yield, which can consist of either tumor or wild-type DNA, was associated with higher VTE rates in all cancer subtypes (Extended Data Fig. 3).

Fig. 1: ctDNA is associated with cancer-associated VTE risk.
figure 1

a, Aalen–Johansen curves and associated Fine–Gray HR for time-to-VTE from time of plasma draw (first ctDNA sequencing) with death as a competing risk for patients with versus without ctDNA in the discovery cohort (P value from two-sided test = 1.1 × 10−15). b, Aalen–Johansen curves further stratified by VAF quartile (q1: detected VAF < 0.005, q2: 0.005 ≤ VAF < 0.21, q3: 0.21 ≤ VAF < 0.112, q4: 0.112 ≤ VAF < 0.99). c, HR for VTE if ctDNA+ by cancer type for all cancers. d, HR for VTE if an alteration in the listed genes was detected versus not detected in plasma (adjusted HR (aHR) for the cancer types in c for all genes with detected alterations in at least 30 patients in the discovery cohort). e, Fine–Gray HR for VTE if ctDNA+ in the validation and generalizability cohorts. For ce, center points denote HR, and whiskers denote 95% CI.

Certain tumor genomic alterations may predispose to VTE34. To assess whether the association with ctDNA alterations and VTE risk was gene specific, we performed subgroup analyses in which we compared VTE risk between patients with specific pathogenic gene-level alterations observed in ctDNA and all other patients without the specific gene-level alteration in question detected in plasma. To account for the fact that cancer type is correlated with both tumor genotypes and VTE risk, cancer type was included as a variable in these analyses. Alterations in nearly all genes had some evidence for association with VTE, although known thrombogenic alterations, such as KRAS, STK11 and KEAP1, had more evidence for association with VTE rates (Fig. 1d). To further test whether the prognostic value of ctDNA may be attributable to tumor genomic content, we performed a sensitivity analysis among patients with matched tumor sequencing with an FDA-authorized targeted sequencing panel (MSK-IMPACT; n = 2,873). In this cohort, pathogenic gene-level alterations as confirmed on tissue sequencing, cancer type, disease stage and ctDNA detection in matched plasma sequencing were included as variables in a multivariate model to predict VTE risk. In this analysis, ctDNA detection was associated with higher VTE risk, whereas most gene-level alterations, including those in KRAS, STK11 and KEAP1, were not associated with VTE risk (Supplementary Table 2). Thus, the association between ctDNA and VTE risk appears largely independent of tumor genomics.

The validation and generalizability cohorts confirmed the association of ctDNA with VTE (Fig. 1e). To control for variability in time of plasma draw since diagnosis, we performed a sensitivity analysis measuring time to VTE from time of diagnosis, left truncating at time of plasma draw. Here, ctDNA detection was also associated with higher rates of VTE (Extended Data Fig. 4a). To assess the chronicity by which ctDNA associates with VTE, we performed a sensitivity analysis limited to patients still at risk 6 months after plasma draw. ctDNA was associated with higher rates of VTE 6 months after plasma draw (Extended Data Fig. 4b). The relationship between ctDNA and VTE was also observed when death was treated as a censoring event rather than a competing risk (Extended Data Fig. 4c).

Some patients (n = 537) had multiple plasma draws. Most patients who were ctDNA+ remained ctDNA+ on subsequent draws (odds ratio (OR) = 7.9, 95% CI: 5.3–11.7); however, a minority of patients did switch from ctDNA+ to ctDNA and vice versa. Stratification by ctDNA detection in the first two plasma draws (median, 308 d apart) suggests that, in serial samples, later ctDNA measurements had greater association with VTE than earlier measurements (Extended Data Fig. 4d). Thus, ctDNA is a prognostic marker for VTE in multiple sensitivity analyses and dynamically reflects VTE risk.

ctDNA levels may associate with both future and prior risk of VTE. To test the latter association, we compared ctDNA levels among patients with versus without prior VTE in the discovery cohort; in those with prior VTE, ctDNA VAF was higher (Extended Data Fig. 5).

ctDNA VAF was correlated with known VTE-associated factors—that is, KS, cfDNA concentration and cytotoxic chemotherapy receipt (Extended Data Fig. 6). It is also known that ctDNA levels are associated with overall tumor burden35; as expected, ctDNA VAF was also correlated with number of disease organ sites (Extended Data Fig. 6). We assessed the independent contribution of these variables to VTE prediction using a multivariate model. ctDNA, cfDNA concentration, KS, chemotherapy receipt and number of disease sites were all independently associated with higher VTE rates (Fig. 2a). The trends in ctDNA and cfDNA as independent predictors of VTE held when the cohort was stratified by cytotoxic chemotherapy receipt, a known risk factor for VTE36 (Extended Data Fig. 7). To further analyze whether ctDNA’s association with VTE was independent of disease stage, we repeated our multivariate analysis in the subgroup of patients with only stage IV or stage I–III disease at diagnosis. In these analyses, ctDNA, cfDNA and chemotherapy receipt also remained independently associated with higher VTE rates stratified by stage at diagnosis (Extended Data Fig. 7).

Fig. 2: Predictors of VTE in patients with cancer.
figure 2

a, Multivariate Fine–Gray model (center: HR, whiskers: ±95% CI) comparing associations between the listed variables and VTE (+ctDNA, patients with detectable ctDNA; cfDNA, cfDNA concentration in ng ml−1 plasma) in the discovery cohort. b, RSF trained on only listed subset of variables (All, all variables in separate categories combined; see Supplementary Information for details). Bar plots show mean Harrell’s c-index, with dots corresponding to individual experiments in five-fold cross-validation experiments or the validation and generalizability cohorts, for time of cancer-associated VTE from time of ctDNA draw. c, Dynamic ROC curve for the probability of VTE at 6 months computed using the listed RSFs. AUC, true-positive rate (TPR; sensitivity), true-negative rate (TNR; specificity) and recall (±95% CI) were computed in five-fold cross-validation (c.v.) in the discovery cohort or using the validation or generalizability cohorts as the test cohort.

It is possible that radiomic features, such as metabolic tumor volume (MTV)37,38,39,40, may better capture disease burden than the number of organs involved with cancer. We tested whether ctDNA remained an independent predictor of VTE in the presence of MTV and the aforementioned variables in a cohort of patients with stage IV, treatment-naive lung adenocarcinoma27. In this analysis, ctDNA remained an independent predictor of VTE (Extended Data Fig. 7). Thus, ctDNA appears to be associated with VTE independent of tumor burden and other features.

To further benchmark the utility of ctDNA VAF as a quantitative biomarker for VTE, we computed the time-dependent area under the receiver operating characteristic (AUROC) curve for VTE within 6 months as well as performance metrics at an optimal threshold (based on Youden’s index41). We compared these results to similar metrics for cfDNA concentration as well as KS and RAM. In this analysis, ctDNA VAF as a single variable had an AUROC of 0.66, greater than that of cfDNA concentration (0.52), KS (0.58) and RAM (0.61). At optimal thresholds, ctDNA VAF had a sensitivity of 0.83, a specificity of 0.44, a positive predictive value (PPV) of 0.60 and a negative predictive value (NPV) of 0.72 (metrics on all tested variables are available in Supplementary Table 3). In summary, ctDNA is a quantitative biomarker for VTE risk with unimodal predictive power greater even than other multifactorial scores, such as KS and RAM.

The independent relationship between cfDNA concentration and ctDNA VAF and VTE and death was also observed in canonical correlation analysis (Supplementary Discussion: ‘Canonical correlation analysis’ and Supplementary Table 4).

LB-based models for VTE

We hypothesized that machine learning models incorporating LB variables (that is, ctDNA VAF, genomic content and cfDNA concentration) would more accurately predict VTE risk than those without such parameters. A random survival forest (RSF; Methods: ‘Machine learning models’) trained on the discovery cohort including LB variables achieved a five-fold cross-validation c-index of 0.73 (95% CI: 0.71–0.75). Addition of cancer type and cytotoxic chemotherapy receipt (‘LB+’ model) achieved a five-fold cross-validation c-index of 0.74 (95% CI: 0.71–0.77); addition of demographic information, sites of disease, time since diagnosis and KS-related variables (‘All’ model; see Supplementary Information for details) achieved a c-index of 0.75 (95% CI: 0.72–0.78; Fig. 2b). By contrast, KS and RAM achieved a c-index of 0.57 (95% CI: 0.55–0.59) and 0.62 (95% CI: 0.58–0.66), respectively. Nonlinear machine learning models trained on KS and RAM components achieved greater performance than those baseline models but inferior performance to LB-based models (Fig. 2b). These trends in model performance held in patients not treated with chemotherapy or those starting a new chemotherapy regimen and across multiple cancer types (Fig. 2b and Supplementary Table 5). Thus, LB-based models outperformed KS and RAM in multiple settings.

Testing the LB+ model on the validation and generalizability cohorts resulted in a c-index of 0.73 and 0.67, respectively. The difference in performance in the generalizability dataset is likely attributable to the cancer type homogeneity; when only patients with NSCLC were included in discovery cohort cross-validation, the c-index for the ‘LB+’ model was 0.70 (95% CI: 0.64–0.74) (Fig. 2b and Supplementary Table 5). The c-index of KS was 0.54 in the generalizability cohort and 0.48 (95% CI: 0.45–0.51) in the NSCLC discovery cohort. Together, these results suggest the superiority of an LB-based model over KS and also highlight the sensitivity of the c-index and AUROCs more generally to cohort homogeneity42.

The time-dependent AUROC, precision and recall for 6-month VTE prediction between the ‘All’ and ‘LB+’ models did not differ (Fig. 2c). In contrast, KS had a lower AUC than the ‘LB+’ model (Fig. 2c). cfDNA concentration was the most important feature in predicting VTE in a model with access to all variables (Extended Data Fig. 8a). Risk scores exhibited a wide range within cancer types (Extended Data Fig. 8b) but effectively stratified VTE risk in all cohorts, with the highest-risk patients having a cumulative VTE incidence of over 25% (Extended Data Fig. 8c).

Differences in DNA extraction methods across assays may result in variable cfDNA yields. ctDx Lung samples had correlated cfDNA concentrations with matched MSK-ACCESS samples in our cohort (Extended Data Fig. 9). Adjusting RSF inputs based on a best-fit linear approximation between MSK-ACCESS and ctDx Lung cfDNA concentrations did not yield better model performance (Supplementary Table 5), suggesting that differences in cfDNA extraction methods did not significantly impact model results.

In summary, LB-based models outperformed KS and other clinical models for predicting VTE. A model including LB parameters and minimal clinical data performed similarly to a model including LB parameters and more extensive clinical variables.

Impact of cardiovascular medications by ctDNA strata

Patients may be prescribed anticoagulants for non-VTE-related reasons. In exploratory analysis using our discovery cohort, we sought to test whether ctDNA presence might help stratify patients most likely to benefit from anticoagulation using non-randomized, real-world evidence. Adjusting for age, cancer type and time since diagnosis, all of which may be associated with use of specific cardiovascular medications43, ctDNA+ patients prescribed anticoagulants had lower rates of VTE than those not prescribed anticoagulants (adjusted HR = 0.50, 95% CI: 0.30–0.81; Fig. 3a). In contrast, ctDNA patients prescribed anticoagulants had no difference in VTE rates from those not prescribed anticoagulants (adjusted HR = 0.89, 95% CI: 0.40–2.0; Fig. 3b).

Fig. 3: Assessing the potential benefit of anticoagulation at time of cohort entry for preventing cancer-associated VTE stratified by ctDNA presence or absence.
figure 3

a,b, Aalen–Johansen curves for time-to-VTE from time of plasma draw with death as a competing risk with or without anticoagulation (ac) in ctDNA+ patients (a) and in ctDNA patients (b). Adjusted hazard ratios (aHRs) for ac are from Fine–Gray proportional hazards models adjusted for age, cancer type and time since diagnosis.

Statin use is associated with lower VTE rates in patients with cancer44,45. In the discovery cohort, statin prescription was associated with reduction in VTE rates in patients who were ctDNA+ but not in those who were ctDNA (Extended Data Fig. 10a,b). In contrast, aspirin prescription, which has equivocal evidence as a VTE-reducing medication46, was not associated with lower VTE rates in either ctDNA+ or ctDNA groups (Extended Data Fig. 10c,d).

Discussion

The association between cancer and thrombosis was recognized over 150 years ago47. The association between elevated plasma nucleic acid levels and cancer is 75 years old48. More recent studies have linked cfDNA to thrombotic risk25,26,49,50, but no large-scale study has confirmed the clinical validity of ctDNA-based VTE risk stratification.

In the present study, we leveraged the rapid advances of LBs in clinical practice35,51,52 to investigate the relationship among cfDNA, ctDNA and cancer-associated VTE. ctDNA was independently associated with VTE. The association between ctDNA and VTE is largely uncharacterized, although our findings align with those of a small study in patients with prostate cancer with plasma sequenced with a custom panel53. Predictive models leveraging machine learning54,55 based on LBs outperformed KS and other models, including RAM, trained only on clinical, radiographic and laboratory values. High-risk cohorts identified by our cohort had a cumulative VTE risk approaching 30%, three times higher than in patients with a high-risk KS, for whom current guidelines9,10 recommend anticoagulation. In ctDNA+ patients but not in ctDNA patients, anticoagulation appeared to lower the risk of VTE, supporting prospective, randomized evaluation of ctDNA-guided prophylaxis or de-escalation.

cfDNA is a thrombogenic component of NETs25,26,49,50, and most cfDNA in LBs is attributable to neutrophils56. In our study, cfDNA was associated with higher VTE rates across cancer types, whereas ctDNA was associated with VTE in many cancer types but not in bladder, hepatobiliary or colorectal cancer. Our finding that ctDNA was not associated with VTE in colorectal cancer is, interestingly, corroborated by a preliminary study of 111 patients with locally advanced rectal cancer, in which no association between ctDNA detection and VTE was observed57.

Together, our findings suggest that ctDNA and non-tumor cfDNA may modulate the hypercoagulable state of malignancy by different means. Although cfDNA may contribute to NET-based coagulation, ctDNA shedding may reflect more aggressive tumor genetic and epigenetic states27,28,31,58 as well as the presence of micrometastatic disease59,60, which may also play a role in clotting61. Extracellular nucleosomes accompany ctDNA62 and, along with other intracellular tumor proteins, may contribute to coagulation63. Further studies including a variety of cancer types are required to determine which of these elements might play a role in the genesis of cancer-associated VTE. Why ctDNA may portend VTE in some cancer types and not in others is complex and deserves further investigation into both tumor biology and host immune response to said biology in shedding versus non-shedding tumors28.

Fewer than 5% of oncologists implement VTE risk assessment tools18. Although uptake can be improved with EHR integration18, such methods do not scale across hospital systems. Similarly, although machine learning models may increase prediction accuracy, particularly when nonlinear relationships exist between variables, such models can be more difficult to implement and interpret. An LB-based model, such as that presented here, may shift the onus of risk stratification from individual providers and hospital systems to assay distributors, for whom genomic sequencing reports and treatment recommendations are already standard deliverables. Because expert guidelines recommend LBs as therapy selection tools64, many patients could, thus, receive effective VTE risk stratification with an LB report with no additional overhead required of the patient or clinician.

VTE may also be a presenting symptom of cancer33,65. Trials to assess whether LBs may aid in the diagnosis of malignancy in patients with idiopathic VTE are ongoing66. Our finding that ctDNA is more frequently detected in patients with prior VTE than in those without further motivates a ctDNA-based approach to cancer screening in this population, although the optimal assay for this purpose is yet to be determined67.

Our study has limitations. Although LBs are increasingly deployed in clinic and are used as biomarkers for a growing number of clinical trials68, they are mostly used at present for targeted therapy matching in certain cancer types, limiting the immediate universality of our VTE prediction approach. DNA LBs are also not universally implemented, and, despite growing adoption across academic and community centers, whether they can and will be used on a global scale remains to be seen. The sequencing assay used in our discovery and validation cohorts used matched white blood cell sequencing to filter germline and clonal hematopoietic mutations. Empirically, we previously found that 11% of ctDx Lung mutations may be attributable to clonal hematopoiesis27; in broader panels without matched white blood cell sequencing, the rate of false-positive mutations may be higher, and this may falsely elevate VTE risk prediction. Our real-world analysis of previous medication administration and VTE rates has many potential confounders, including comorbidities leading to anticoagulation, statin or aspirin use. Randomized studies taking into account the evolving anticoagulation landscape6,7,69,70 are necessary before prophylactic anticoagulation based on LBs can be applied in clinic. Our population had sparse minority representation and a plurality of patients with advanced disease27; generalizability of results should be further studied. Other clinical, genetic21 and blood-based biomarkers22,23 may add orthogonal information and, thus, result in a superior risk model. Future studies integrating these modalities are necessary to determine how to best risk-stratify patients with cancer for VTE.

Overall, our findings suggest that, in patients with cancer, ctDNA is independently associated with higher rates of VTE in a quantitative manner. DNA LB-based models have the potential to predict VTE risk on a spectrum and to be deployed with minimal clinician burden. The use of LBs to guide anticoagulation in patients with cancer merits further validation.

Methods

Cohort selection

We studied three cohorts. (1) A discovery cohort and (2) a prospective validation cohort, both including patients with any cancer type and plasma MSK-ACCESS sequencing who were evaluated and treated at MSK, an academic cancer center. Patients in the discovery cohort had sequencing between 10 June 2019 and 15 September 2022 with follow-up until 31 October 2022, whereas those in the validation cohort had sequencing between 16 September 2022 and 30 September 2023 and follow-up until 31 October 2023. (3) A generalizability cohort was enrolled including patients with plasma sequenced by a different assay, ctDx Lung, and with diagnoses of stage IV or recurrent NSCLC treated at MSK or GenesisCare (Sydney Australia), a community-based oncology practice, between 21 October 2016 and 1 November 2020. This cohort was followed until 31 August 2022 (ref. 27). ctDNA testing in all cohorts was administered at the provider’s discretion. Sample size was determined by the number of patients available at date of data cutoff, and no power calculation was used to determine sample size.

This study was independently approved by the institutional review boards (IRBs) of MSK and GenesisCare. MSK patients in all three cohorts were enrolled as part of a prospective observational biospecimen collection and sequencing protocol (NCT01775072). Patients from Sydney in the generalizability cohort were enrolled as part of a prospective observational biospecimen collection and sequencing protocol (‘Genomic profiling in cancer patients’, approved by the GenesisCare/Northern Cancer Institute IRB in 2017). Patients provided written informed consent and were enrolled in a continuous, non-random fashion. Patients were not compensated financially for participation in the study. Patients with prior cancer-associated VTE were excluded from VTE risk analyses (Extended Data Fig. 1). Events were defined as any new pulmonary embolism or lower extremity deep vein thrombosis event, whether incidental or symptomatic.

ctDNA sequencing

Details of the MSK-ACCESS32 and ctDx Lung (Resolution Bioscience, Exact Sciences)71 protocols were published previously. In short, both Clinical Laboratory Improvement Amendments (CLIA)-certified and New York State-approved methods use error-corrected, hybrid capture-based next-generation sequencing to detect mutant DNA in plasma with a validated VAF detection limit of 0.1–0.5% and the ability to call lower VAF mutations in cases of sufficient coverage. MSK-ACCESS and ctDx Lung probes cover 129 and 21 genes, respectively. MSK-ACCESS includes matched white blood cell sequencing to filter germline and clonal hematopoietic variants. ctDx Lung filtering was performed using germline databases as previously described27.

Statistical analysis

We studied time-to-VTE from time of first ctDNA plasma draw, right-censored at time of last follow-up using the Aalen–Johansen estimator with all-cause mortality as a competing risk. Subgroups were compared using cause-specific Fine–Gray regression72. Patients with detectable ctDNA (ctDNA+)—that is, any reported ctDNA mutation or copy number alteration—were compared to those without detectable ctDNA (ctDNA) for rates of VTE. In exploratory analysis to assess the dose dependence of ctDNA on VTE risk, we created Aalen–Johansen survival curves with ctDNA+ patients grouped into one of four quartiles based on the ctDNA VAF.

We repeated Fine–Gray regression to test for the association of ctDNA with VTE in subgroups by cancer type and, separately, by pathogenically altered gene (as annotated by an FDA-recognized molecular knowledge database73) detected in ctDNA adjusted for cancer type given known associations with specific gene alterations and cancer type74. Using multivariate regression, we assessed the independent association with VTE of ctDNA detection, log10(cfDNA concentration), KS, receipt of cytotoxic chemotherapy within 30 d and number of sites of disease as annotated from radiology reports75,76.

To assess the association of anticoagulation with rates of future VTE in an exploratory manner, we performed Fine–Gray regression77 analyses adjusted for age, time since diagnosis and cancer type (variables thought a priori to be associated with anticoagulation use43), comparing those prescribed anticoagulants (apixaban, rivaroxaban, edoxaban, enoxaparin, warfarin, dalteparin, fondaparinux or dabigatran) at cohort entry to those not prescribed anticoagulants stratified by ctDNA detection.

Machine learning models

We created machine learning models (RSFs78,79) trained on the discovery cohort to predict risk of VTE from time of blood draw using the aforementioned variables as well as demographics, detected alterations at the gene level and specific organ sites of disease as inputs (details below). Models using subsets of the available variables were trained to assess the relative performance of models with certain data. Performance was assessed using Harrell’s c-index and discrimination of VTE events at 6 months using time-dependent80 AUROC, precision and recall. Models were assessed using five-fold cross-validation81. Final models trained on the entire discovery cohort were also tested in the entire held-out validation and generalizability cohorts.

RSFs were trained using pre-assigned hyperparameters (n trees = 1,000; minimum n splits = 10; minimum n samples per leaf = 15). In exploratory secondary analyses, a random hyperparameter grid search to find ‘optimal’ hyperparameters using a 20% holdout for evaluation was conducted (n tree range, 200–2,000; minimum n splits range, 5–20; minimum n samples per leaf range, 5–30; n search iterations = 100; three-fold internal cross-validation for hyperparameter selection); a model trained on optimal hyperparameters did not yield better results (c-index ‘improvement’ of −0.01 using optimal versus pre-assigned hyperparameters). The following variables were considered (grouped by variable type):

KS components

Closest white blood cell count, hemoglobin, platelet count and body mass index (BMI) before plasma draw (continuous), receipt of chemotherapy within 30 d of plasma draw (binary) as well as the cancer types in Fig. 1c as one-hot encoded variables. For patients in whom no laboratory or BMI data were available, the cohort median was imputed (for Sydney patients in whom KS data were not available, this model was not tested).

Tumor sites

The following were encoded as one-hot variables based on tumor presence in a preceding radiology report: Brain, Bone, Liver, Lung, Pleura, Abdomen, Lymph, Adrenal and Other (for Sydney patients in whom data were not available, this model was not tested).

Demographics

White, Black or Asian race, sex (male or female), time since cancer diagnosis (continuous) in days and cancer type as above.

LB

log10 cfDNA concentration (continuous), log10 ctDNA, VAF (continuous), pathogenic alterations as annotated by OncoKB in one of the genes in the MSK-ACCESS panel (altered in >5% of the entire cohort) were encoded as one-hot encoded variables.

LB+

All variables in LB as well as cancer type and chemotherapy receipt.

All

All variables above.

Models were evaluated in the MSK-ACCESS cohort by five-fold cross-validation. Models trained on the entire MSK-ACCESS cohort were evaluated in the ctDx Lung MSK cohort. For LB and LB+ models (in which all data were available), the Sydney cohort was used as an additional validation cohort. In the ‘adjusted cfDNA’ model, ctDx Lung cfDNA concentrations were transformed according to the linear equations described in Extended Data Fig. 9 before being input to the model.

In validating the model’s utility to predict VTE within 6 months of plasma draw, precision and recall were reported at an ‘optimal’ point minimizing false-positive rate and maximizing true-positive rate.

Analyses were performed in Python 3.10.11 using the lifelines 0.26.0 and sksurv 0.20.0 packages or in R 3.6.1 using the cmprsk 2.2-11 and survivalROC 1.0.3.1 packages.

Data capture

VTE events in the discovery and validation cohorts were annotated using the CEDARS (https://cedars.io) and PINES (https://pines.ai) natural language processing (NLP) packages to identify patient records with candidate thromboembolic events and were manually confirmed by chart review of clinician notes and diagnostic scans82,83. An in-depth discussion of the methodological approach and a link to the code base repositories for the most recent versions of the two packages can be found on their respective web pages.

Clinical notes and radiology reports from 1 year before cohort entry up to the date of censoring were included for all patients. Those documents were first processed through the CEDARS pipeline. Individual sentences were retained along with their corresponding documents if they matched the following query:

‘dvt OR pe OR vte OR thrombos* OR thrombus OR thrombi OR thrombotic OR clot OR *embol* OR *phlebitis’

The documents were then labeled with an estimated probability of coming at or after a VTE event using a longformer model (https://huggingface.co/allenai/longformer-base-4096) previously fine-tuned using the PINES methodology. The document-specific probability threshold used for this work corresponded to a sensitivity of 95% for detecting cancer-associated thrombosis events in model validation. Only sentences associated with those selected documents were retained and presented to a physician reviewer for assessment with the CEDARS graphical user interface.

The interface presents one sentence at a time with its associated clinical note or radiology report in the same screen view. The human reviewer enters an event date as indicated, after which the application automatically moves on to the next patient. The EHR can be consulted separately if additional information is needed. A board-certified hematologist (S.M.) oversaw the discovery and generalizability cohort annotation process; validation cohort events were primarily annotated by a board-certified oncologist (J.J.). Once the entire cohort has been reviewed, CEDARS generates a table including patient identifiers, event dates, included document dates and text for selected sentences. This dataset can be audited and used as is for time-to-event analyses.

The current CEDARS methodology was used uniformly for the discovery and validation cohorts. An earlier version of CEDARS was used for the generalizability cohort for the years 2017–2019. Events for the generalizability cohort were annotated as previously described for patients accrued in 2016 (ref. 34). Events for the Sydney ctDx cohort were manually physician annotated by A.L. and L.G. Curators were blinded to ctDNA status during curation.

NLP data audits

Two hundred patients were randomly selected from each of three MSK cohorts:

  1. 1.

    MSK IMPACT cohort (2014–2019, n = 35,391, ref. 34), including 361 patients from the generalizability cohort

  2. 2.

    Discovery cohort

  3. 3.

    Validation cohort

Clinical notes and radiology reports were assessed manually for each known VTE case from the original datasets to confirm events detected with the CEDARS + PINES NLP platform. Hematology and anticoagulation clinic notes were reviewed for all patients retained in the audit to look for VTE events potentially missed by NLP. All patients in the audit were assessed for International Classification of Diseases 9 (ICD-9) and International Classification of Diseases 10 (ICD-10) codes potentially revealing a qualifying VTE event during the observation period. Codes used are shown in Supplementary Table 6.

All patients without a VTE event found with NLP but with a VTE ICD code detected were reviewed manually, looking at individual notes and radiology reports. Recall (also known as sensitivity) and precision (also known as positive predictive value) were calculated. Results of the audit for those three cohorts are shown in Supplementary Table 7. ICD code review did not reveal any qualifying VTE event missed with NLP. The manual review of hematology/anticoagulation clinic notes revealed one missed case due to data entry oversight (human error). Two false-positive events were uncovered. Combining the three audit cohorts, the overall recall for NLP was 99%, and the overall precision was 98%.

Organ sites with tumor involvement were automatically extracted from the EHR using previously validated NLP methods75. A Bidirectional Encoder Representations from Transformers (BERT) model was trained and validated on a manual 31,455-report corpus in an 80:20 train:test split. Annotation algorithms used for the analysis presented here had an average AUC of 0.981 and micro-average precision/recall of 87.5/89.6. This approach was shown to have higher recall than structured data approaches, such as those based on billing codes alone.

Structured data were obtained from our EHR as follows: demographics were self-reported; medications, including antineoplastics, were derived from electronic prescription records; cancer stage and time of diagnosis were derived from cancer registry data; and laboratory data were obtained from an institutional laboratory medicine database.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.