Main

High-throughput sequencing technologies enable analyses of various patient characteristics, such as genetic variation1, DNA methylation2, gene expression3, gut microbiota4 and adaptive immune receptor repertoires (AIRRs)5. Such molecular and biological markers (biomarkers), defined as objective indications of the medical state that can be accurately and reproducibly measured6, hold great promise for machine learning (ML) disease diagnostics2,3,5. However, several challenges exist to using ML: study participants are selected based on availability (‘convenience sampling’), and data collected at multiple locations or distinct time points are combined, which may introduce systematic differences between datasets. A failure to account for such differences (for example, measurement errors or batch effects) can introduce selection and confounding biases that lead to models failing in real-world applications, despite showing promising performance during diagnostic development7,8,9,10. Finally, sequencing data are typically high-dimensional, making it more challenging to disentangle noise and biases from the true markers of the disease.

ML approaches examine these challenges from a purely statistical perspective by anticipating how the distributions of features or labels will change (a phenomenon called ‘dataset shift’ or ‘distributional shift’) and include domain adaptation11,12 (when some information about the target domain is available) and domain generalization techniques13,14,15 (where the target domain is unknown, but (multiple) source domains are available). Traditionally, these approaches do not consider causal relations between features and labels.

More recently, the causal inference framework16,17,18 has also been applied to describe dataset shifts using formal definitions with respect to proposed causal models of the underlying processes19,20. Do-calculus16, the fully non-parametric CI framework described by Pearl (Box 1), can be used to estimate causal effects from non-experimental data whenever the effect is identifiable under a given causal model. The structure of the causal model is typically encoded using a (directed acyclic) causal graph, where the nodes represent the variables of interest, and the directed edges between the nodes represent direct causal relationships.

One way to obtain the causal structure of a biological process is from domain knowledge. However, this is challenging due to the complex and unknown nature of disease mechanisms. Alternatively, obtaining a causal structure can be approached by (1) learning the structure from data using causal structure learning21,22, or (2) learning only the part of the causal model that results in robust (or invariant) predictions across different environments23,24,25,26,27. Causal structure learning typically attempts to infer the complete structure, which is difficult due to the high-dimensional nature of the problem, yet it can be applied to a single set of observational data. Invariant prediction is more focused: it attempts to identify features producing stable predictions for the variable of interest under general interventions (that is, different environments). However, it relies on multiple datasets generated under a sufficiently diverse set of interventions. When building the final ML model, accounting for the inferred causal structure should improve robustness to various dataset shifts7,19,28.

In addition to potentially improving the performance of learned models, causal inference can help to formally or intuitively analyse diagnostic robustness across application contexts. Causal inference might also be useful when combining multiple datasets sampled under heterogeneous conditions to answer a probabilistic (or causal) query of interest, thus dealing with biases emerging due to environment change, confounding and participant selection29. This motivates considering the causal perspective as an essential component for diagnostics30, medical image analysis8, decision-making in healthcare31 and the clinic in general32.

When analysing sequencing data to discover molecular biomarkers of disease, three possible underlying causal structures may connect a biomarker with a disease. The biomarkers from sequencing data may be causing the disease, may arise as an effect of the disease, or the biomarkers and the disease may both have a common cause. Disregarding the common cause scenario for now, this means that the diagnostic prediction may be in a causal direction (predicting the effect from causes; for example, finding changes in the sequencing data that have played a role in causing a disease) or in an anticausal direction33 (predicting causes from effects; for example, finding differences in the sequencing data that occurred as a consequence of the disease, Fig. 1c). Depending on this direction, dataset shifts will manifest in different ways. For example, when predicting in the causal direction (X → Y), we might expect the performance to be more stable under changes to P(X) because our target, P(Y|X), is a component in the causal factorization P(X, Y) = P(X)P(Y|X) and thus independent of P(X) due to the principle of independent causal mechanisms27. On the other hand, we would expect no such robustness under the anticausal direction (X ← Y) because P(Y|X) does not follow the causal structure.

Fig. 1: Developing an AIRR-based diagnostic.
figure 1

a, Overview of the diagnostic pipeline based on AIRRs, including patient collection from the source population, sampling, sequencing with batch effects, ML method development, and application in the target population. b, An example of a causal structure of AIRRs and the immune state, where the nodes represent the different variables involved (repertoire, HLA type, age, immune state) and the arrows represent causal relationships between variables. Filled nodes in the graph (immune state, HLA type, sequenced AIRR) denote observed variables, and open nodes are not observed (prior immune state). The node with S inside is the selection node, with edges showing what variables influenced the selection of participants for the study. c, For AIRR-based diagnostics, predictions may be either made in the causal direction (predicting the effect from the cause, for example, in autoimmunity) or in the anticausal direction (predicting the cause from the effect, for example, in infections), making the models predicting in the anticausal direction potentially less stable because they are not modelling the biological mechanism.

Because of the opportunities for ML in diagnostics, we build on existing literature on causal inference and ML20,27 and focus on complex, real-world applications in the field of adaptive immune receptor repertoires (AIRRs), which are increasingly used for diagnostic purposes5,34. AIRRs are high-dimensional molecular markers reflecting an individual’s past and present immune responses and can be obtained from targeted high-throughput sequencing from a blood sample (Fig. 1a and Box 2). AIRR-based approaches may enable earlier diagnosis and prognosis, complement existing diagnostic tests, and have, in principle, the capacity to diagnose a broad range of diseases with a single test5. However, validation studies on external cohorts (for example, expanding on existing approaches35) are needed to establish the robustness of existing models.

In the following sections, we define and discuss different challenges in the study design for biomarker discovery, including confounders, batch effects, selection bias, variability of causal models across diseases, and the high dimensionality of molecular data. We present a simulation study to illustrate these concerns and conclude with a proposal of reporting standards and study design advice, finally outlining open questions in the field.

Challenges in AIRR diagnostics study design

The main challenge of ML in diagnostic settings is whether the probability distributions learned from the training data will generalize to new application settings.

The set of examples (study participants) available at training time will be called the study sample (or study cohort), sampled from the underlying source (development) population (Fig. 1a). The population where the classifier will be applied is the target (deployment) population (environment or domain). If the source and target populations have the same joint probability distribution (disease prevalence and feature distribution, and the relations between them stay the same) and examples (for example, AIRRs) are independent and identically distributed (i.i.d.), the estimated ML model can be readily applied to the target population, provided that the model is internally valid (Box 3).

Although statistically convenient, the i.i.d. assumption rarely holds in the real world36—the probability distribution might change from source to the target population in the marginal or conditional distribution of variables. Marginal distributions may vary due to label shift (for example, change in disease prevalence) or covariate shift (for example, change in age distribution). The conditional distribution of variables may change if it describes an anticausal relation33 (when predicting the cause from the effect, for example, immune state from AIRR; Fig. 1c) or due to the occurrence of unstable mechanisms7 (for example, changing the time of sequencing in the course of the disease might result in estimates that only hold for the study cohort). Importantly, these shifts reflect systematic biases that would hold up even if a study cohort was infinitely large, and their extent cannot be quantified based only on information from the source population. The biases may arise from different aspects of the data-generating process, and when related to both AIRR and the immune state, they might introduce spurious correlations.

To illustrate these concerns, we provide an overview of AIRR-based diagnostic development (Fig. 1a). The study cohort is selected, and the targeted cell population (for example, T cells) is DNA-sequenced and analysed using ML. We also introduce an example of an AIRR-based diagnostic for a viral infection (Fig. 1b). In this example, the immune state is defined as the presence of the pathogen that (causally) changes AIRR. In addition to the immune state, previous immune events (for example, infections or vaccinations), age37, sex38, genetics (including the V(D)J recombination model39), the environment (for example, geographical location) and human leucocyte antigen (HLA)40,41 also influence the AIRR. Finally, the observed sequencing data reflect only a limited proportion of a patient’s full AIRR and introduce sampling variability. The experimental protocol may also introduce systematic biases in terms of which receptors are captured42,43,44, which is especially problematic if the experimental protocol varies in a way that correlates with other patient variables. The causal graphs might differ for different types of disease.

How confounders affect the analysis

Age, sex and genetic background influence the immune repertoire. Repertoire diversity decreases with age37,45, and sex affects the usage of V genes in T cell receptor (TCR) repertoires38. Sex and age also influence the immune state through innate immunity46,47, representing potential confounders in disease diagnostics (Box 1). Environment, broadly defined as a proxy for socioeconomic background and geographic location, may also be a confounder. Unlike age or sex, it is typically unobserved in AIRR studies (unmeasured or hidden confounder).

For predictive purposes, confounding is not generally problematic and might even improve the performance of the model48,49. However, the recovered biomarkers could reflect the confounders as much as the immune state9. If the aim is to gain insight into the underlying biological process (or estimate causal effects), confounding should be controlled for. Additionally, if the source and target populations differ, confounder distributions or functional relations may change, potentially changing the perceived (non-causal) relationship between the immune state and AIRRs.

Batch effects and timing of measurements

Batch effects are systematic biases connected to experimental protocols exhibiting different behaviour across conditions, with a certain level of bias always being present in the sequencing (molecular) data50. If batch effects are independent of the labels to be predicted, they will not introduce any bias in the learned ML models, leading only to increased variability of the biomarker. Batch effects are more problematic when correlated with the label (for example, immune state) when a predictive model may perform well in the study cohort by learning batch effect associations. Such a model would fail when applied to a population where the batch effect is differently associated with the disease. This is also of interest when multiple datasets need to be combined for a study. Batch effects in AIRRs42,43,44,51 might manifest through sequencing errors, differences in gene usage between protocols, the sensitivity of detecting rare receptors, and skewed diversity capture42.

The timing of measurement, that is, when the sequencing is performed in the course of the disease, also affects diagnostic development. If positive examples (diseased individuals) in the dataset are collected after individuals have received treatment, the collected AIRRs will not be representative of the AIRRs of individuals who will get tested for diagnosis. To mitigate this, the study cohort should be representative of the target population in terms of the timing of measurement or include individuals sequenced across the disease progression spectrum. Alternatively, different disease stages could be modelled separately, making the task a multiclass classification problem.

Selection bias and choosing participants for a study

Selection bias is defined slightly differently in causality and ML. In causality, selection bias is any statistical association resulting from selective inclusion into the study cohort (for example, participants are recruited based on some of the variables of interest in the analysis)52. For example, if only individuals with symptoms are tested for a diagnostic of a viral infection, the study cohort is not representative of the source population as it ignores asymptomatic individuals. Selection bias does not depend on the cohort size53 and can introduce, increase, decrease or even reverse the sign of existing associations54. Given a causal graph, selection bias may be present whenever the data-collection process depends on the cause and effect or the parents in the causal graph55 (Fig. 2).

Fig. 2: Examples of selection bias in causal models.
figure 2

ac, Selection bias may occur by selecting based on a collider (a), because a confounder influences selection (b) or by selecting based on the effect variable (c).

In ML, selection bias has a less structural definition: it occurs when there is a difference in the marginal distribution of any variable used for prediction (covariate shift56) between the study cohort and the source population, or a difference in the marginal distribution of the label (such as the disease status, resulting in label shift57). These definitions do not rely on causal graphs. However, considering the underlying causal models can improve ML analysis by specifying how to recover from biases under given assumptions.

Closely related to selection bias is the concept of transportability29,58 (related to external validity; Box 3). Transportability bias occurs when moving from a source population to a distinct target population, where the target and source populations are at least partially non-overlapping59. An example might be an AIRR-based diagnostic built in Norway (source population) that needs to be applied in Serbia (a target population that might arbitrarily differ in the marginal distribution of variables, such as HLA or age), where the causal mechanisms of the disease (the influence of the pathogen on AIRRs given HLA and age) remain stable.

HLA can take on alternative causal roles

Depending on the disease, biological variables can have different roles in the causal graph, with important implications for the analysis. For example, HLA molecules present peptides derived from pathogens to T cells, and thus shape the T cell repertoire. HLA also influences the TCR repertoire during positive and negative selection during T cell maturation in the thymus60. HLA can therefore affect the TCR repertoire composition of both naive and antigen-experienced T cells, making HLA an important variable in diagnostic development.

Depending on the assumptions of the causal model describing the disease, the role of HLA in the analysis will differ. For viral infections, HLA is considered a precision covariate (Fig. 3a)—it will influence the AIRR but not the immune state. Theoretically, adjusting for it will not resolve any bias, but it might improve the precision of the diagnostic. In practice, this might depend on the amount of data available for different HLAs.

Fig. 3: The different causal roles HLA can take for different types of immune-related disease.
figure 3

a, In a viral infection, HLA is a precision covariate. b, In coeliac disease, HLA is both a confounder and a moderator. c, In cancer, HLA can be a mediator.

HLA might be a confounder in AIRR-mediated autoimmunity, where AIRR causes the immune state (Fig. 3b). Additionally, HLA can act as a moderator (Box 1), for example, in coeliac disease. Coeliac disease is an autoimmune condition occurring due to gluten consumption. For the disease to occur, the subjects have to both carry specific HLA allotypes (HLA-DQ2 or HLA-DQ8) and have gluten-specific TCRs61. Therefore, exploiting HLA information might be necessary to develop a diagnostic for this disease.

In some cancers, tumour cells have somatic mutations that affect peptides binding to HLA and help tumour cells evade immune recognition62. In this case, HLA acts as a mediator between disease and the AIRR (Fig. 3c) and, as such, does not need to be adjusted for in the analysis when developing a diagnostic.

High dimensionality of AIRR data

Building diagnostics based on AIRRs (or other molecular data) is made more challenging by their high dimensionality. In this Perspective, we have represented AIRRs by a single node (variable) in the causal graph, which then represents millions of individual AIRs.

AIRRs consist of a large number of sequences that are mostly non-overlapping between individuals, with very few of them specific to any one disease63 (Box 2). Some approaches represent AIRRs by their observed sequences, physicochemical properties or summary statistics. Alternatively, it is possible to learn an AIRR representation. As the sequence specificities are primarily unknown, self-supervised representation learning methods might be the most suitable: pretraining methods64, fitting generative models that learn the data distribution with latent variables being used in downstream tasks65, or imposing constraints on the learned representation space via alternative labels or training tasks66,67,68. To ensure the robustness of representations learned in this manner, the data may come from multiple distributions25,69,70, or algorithms that try to infer latent causal variables may also be used27. However, the interpretability of the learned representations and their relations to the causal model remain a challenge20.

Although causality for ML in the high-dimensional setting of medical imaging8 might have some parallels with molecular datasets, the imaging causal models differ substantially from molecular biomarkers, posing the question of exactly how the different biases discussed earlier manifest in high-dimensional sequencing data.

So far, we have used AIRRs to refer to both single- and paired-chain TCR and B cell receptor repertoires, although often only single-chain receptors are sequenced. The causal model could be extended to include paired-chain TCR and B cell receptor repertoire data as separate variables depending on the research question. Each of these receptor populations might be further split for causal modelling to allow for complex interactions between different cell subtypes or localizations.

Simulation study for AIRR diagnostics study design

To illustrate the influence of different variables in the causal model on the performance of ML algorithms for diagnostics, we performed three simulation experiments where we systematically varied the causal parameters. In the first experiment, we trained a model to predict the immune state based on AIRRs without taking confounders (for example, age or sex) into account. We showed that, with a changing confounder distribution and keeping the classes balanced, the performance (measured by balanced error rate) might drop substantially (Fig. 4a,b). In the second experiment, we showed how selection bias may lead to poor performance on an independent target population due to spurious correlations (Fig. 4c,d). In the third experiment, we contrasted the handling of batch effects for the AIR setting against a different molecular biomarker where batch effect handling was more established. We showed that batch effects might lead to higher error rates and result in classifiers learning spurious signals, especially in AIRR settings (Fig. 5). A description of the experiments is provided in the Supplementary Information.

Fig. 4: Experiments showing immune-state prediction performance under different causal models.
figure 4

ad, Results are shown for 500 AIRRs for training and validation and 500 for testing, with 500 TCRβs per repertoire across 30 repetitions. The median values of the balanced error rate are shown in the plots. The immune signal indicative of the immune state consisted of 3-mers implanted approximately in the middle of the receptor sequence. For scenarios b,d, 3-mer frequencies and logistic regression were used. a, Causal model for experiment 1. The yellow arrow denotes that the confounder distribution changes between the training (validation) and test population. The immune state is balanced in both populations. b, The balanced error rate when the training (validation) and test sets originate from the same distribution (the confounder distribution is stable), when the distribution changes slightly (P(confounder)validation = 0.6, P(confounder)test = 0.8), and when the confounder distribution substantially changes (P(confounder)validation = 0.8, P(confounder)test = 0.01). c, The causal model for experiment 2. The edges in yellow denote the relations modified in the experiment. d, The balanced error rate when the selection bias is present in the training population but not in the test population.

Fig. 5: Batch effects may lead to higher error rates in transcriptomic and AIR settings and might result in classifiers learning spurious signals, especially in the AIR setting.
figure 5

ad, Results are shown for three scenarios: batch effects present but uncorrected, corrected batch effects using linear regression on k-mer frequencies, and no batch effects in the data (control). Median values are shown throughout. a, Transcriptomic classification performance, where uncorrected batch effects lead to the highest balanced error rate. b, The number of enriched genes detected by the L1-regularized logistic regression model and their overlap with the simulated biological signal. c, AIR classification performance using k-mer frequencies for data representation and logistic regression, where uncorrected batch effects lead to the highest error rate, although the difference is small. d, The recovered 3-mers from the logistic regression model are grouped by how much they overlap with the 3-mers of the immune signal across the three scenarios. The recovered 3-mers from the model are obtained as the features corresponding to the 30 largest coefficients in the L1-regularized logistic regression model, in absolute values.

Conclusion

Advice on AIRR study design and computational processing

We propose the following guidelines to learn AIRR-based biomarkers that generalize well to clinical settings. (1) Ensure that batch effects, although nearly always present, only influence the observed AIRRs and are not correlated with the immune state. Avoid systematically different protocols and recruitment periods across different labels. (2) Internal validity, occurring when the targeted probability distribution is learned instead of noise (Box 3), has to be achieved through appropriate assessment strategies and sufficiently large study cohorts71. Total cohort sizes are not the main focus of this Perspective because small cohorts would only increase the variability of estimates but not introduce bias per se. However, an insufficient number of participants makes it hard to achieve internal validity and, for domain adaptation or transfer learning, to determine whether there are true systematic differences between settings. Additionally, recruiting a sufficient number of participants for each confounder value group is necessary. (3) Avoid selection biases that may introduce spurious associations when recruiting study participants. One exception to avoiding selection biases is when they are deliberately introduced (and compensated for) to enrich signals for ML, for example, by balancing the classes when training a prediction model. Furthermore, in the case where the target population is known to differ from the source (training) population in a variable that has a major influence on the AIRRs or immune state, we advise considering techniques such as pretraining that might help with dataset shifts72, using data from multiple environments, if available, to obtain more robust representations25,69,70, and exploring importance weighting73. We illustrate how failing to follow these recommendations might influence the prediction task in worked examples (Fig. 4).

Proposed reporting standards for AIRR diagnostic study design

We propose the following reporting standards to increase the trustworthiness of AIRR diagnostic studies and ensure their applicability in future use cases, such as meta-analyses where multiple studies are examined together to answer the research question better. (1) Report the sets of AIRR samples that have been processed together in batches. (2) For each AIRR, provide information on recruitment source, experimental protocol and institution. (3) If external validity is anticipated, define the target (deployment) setting where the diagnostic could be applied. (4) Report metadata, including sex, age, HLA and similar properties outlined by the MiAIRR standard74,75. Results per strata should be provided for any variable considered to have a major impact on AIRR and immune state (consult the state of the art in the AIRR field and disease field at the time of publication). Include information on genetic ancestry and aim to cover diverse patient cohorts76,77. Additionally, reviewing study protocols in advance, for example, through Registered Reports78, may alleviate some of the concerns described previously.

Suggested research directions for the AIRR field

One major open question is how the HLA influences AIRRs5,34,41,79. Strong correlations between HLA and the CDR3 regions of TCRs have recently been observed, indicating that HLA risk allotypes might increase the frequency of autoreactive TCRs already during T cell development41. From a diagnostic perspective, the HLA influence can be seen as two sub-questions: (1) the degree to which HLA leaves a detectable mark in the overall AIRR that can be leveraged to capture the disease-predictive information of HLA by AIRR sequencing alone (leveraging the indirect path AIRR ← HLA → disease), and (2) the degree to which HLA moderates the direct AIRR–disease relation so that ML models need to learn distinct predictive patterns for individuals with different HLAs.

So far, we have considered diagnostics development as a binary classification problem. However, it could be extended to consider multiple classes illustrative of disease stages for a single disease or multiple diseases80. One way to handle multiple disease stages might be to estimate a ML model in a one-versus-all fashion (with possibly shared representation of sequencing data), thus allowing distinct features to be learned as relevant for each disease stage. Multiple diseases could also interact, making the analysis more challenging81. Interactions could lead to structural causal models with cycles82. Finally, in dynamic treatment regime settings83, biomarkers could support adaptive treatment decisions through multiple stages of disease progression for individual patients.

Although we argue that causality is important for ML robustness and diagnostic development study design, causality is also an aim in itself in terms of describing biological mechanisms84. Establishing causal AIRR models would enable improvement of AIRR-based diagnostics and allow for causal interpretations and estimations of the effects of interventions. For example, a sufficiently detailed AIRR model and the methodology to successfully handle high-dimensional data may allow computational screening of new candidate therapies and vaccination procedures.