## Main

Medical image analysis has achieved tremendous progress in recent years thanks to the development of deep convolutional neural networks (DCNNs)1,2,3,4,5. At the core of DCNNs is visual representation learning6, where pre-training has been widely adopted and become the most dominant approach to obtain transferable representations. Typically, a large-scale dataset—also called the source domain—is first used for model pre-training. Transferable representations from the pre-trained model are further fine-tuned on other smaller downstream datasets, called target domains.

Radiologists and computer scientists have recently managed to build medical datasets for label-supervised pre-training at the size of hundreds of thousands of images, such as ChestX-ray11, MIMIC17 and CheXpert18. To acquire accurate labels for radiographs, these datasets often rely on a two-stage human intervention process. A radiology report is first prepared by radiologists for every patient study as part of the clinical routine. In the second stage, human annotators extract and confirm structured labels from these reports using artificial rules and existing natural language processing (NLP) tools; however, there are two major limitations of this label extraction workflow. First, it is still complex and labour intensive; for example, human annotators have to define a list of alternate spellings, synonyms and abbreviations for every target label. Consequently, the final accuracy of extracted labels heavily depends on the quality of human assistance and various NLP tools. A small mistake in a single step or a single tool may give rise to disastrous annotation results. Second, those human-defined rules are often severely restricted to application-oriented tasks instead of general-purpose tasks. It is difficult for DCNNs to learn universal representations from such application-oriented tasks.

In this paper we propose reviewing free-text reports for supervision (REFERS) to directly learn radiograph representations from accompanying free-text radiology reports. We believe abstract and complex logic reasoning sentences in radiology reports provide sufficient information for learning well-transferable visual features. As shown in Fig. 1a, REFERS is realized using a set of transformers, in which the most important part is a radiograph transformer serving as the backbone. The main reason why we choose the transformer as the backbone in REFERS is that it not only exhibits the advantages of DCNNs, but also has been shown to be more effective19 due to the self-attention mechanism20. Moreover, we have found that, in comparison to features generated from DCNNs, features from transformers are more compatible with textual tasks.

On four well-known X-ray datasets, REFERS outperforms self-supervised learning and transfer learning on natural source images in producing more transferable representations, often bringing impressive improvements (greater than 5%) under limited supervision from target domains. This capability can be extremely important in real-world applications as medical data are scarce and their annotations are usually hard to acquire. More surprisingly, we found that REFERS clearly surpasses those methods that employ a source domain with a large collection of medical images with structured labels. In terms of specific abnormalities and diseases, REFERS is quite effective under extremely limited supervision (<1,000 annotated radiographs during fine-tuning). For instance, REFERS brings about 9% improvements on pneumothorax. Meanwhile, over 7% improvements are achieved on two common lung diseases (atelectasis and emphysema).

## Results

All self-supervised learning and label-supervised pre-training (LSP) baselines, as well as REFERS, are first pre-trained on a source domain of medical images (that is, MIMIC-CXR-JPG23). Pre-trained models are then fine-tuned on each of four well-established datasets (target domains with labels), including NIH ChestX-ray11, VinBigData Chest X-ray Abnormalities Detection24, Shenzhen Tuberculosis25 and COVID-19 Image Data Collection26. During the fine-tuning stage, we always perform fully supervised learning on the target domain, which only consists of radiographs with structured labels. Furthermore, we verify model performance by varying the percentage of actually used training images (sampled from the predefined whole training set) in the target domain; this percentage is called the label ratio. When the label ratio is 100%, we use the whole training set in the target domain for fine-tuning.

### NIH ChestX-ray

Table 1 and Extended Data Figs. 1 and 2 present experimental results from REFERS and other approaches under different label ratios. As shown in Table 1 and Extended Data Fig. 1, our approach outperforms self-supervised baselines and transfer learning on natural source images substantially. To be specific, REFERS achieves the highest area under the receiver operating characteristic curve (AUC) on all 14 classes using different amounts of training data during the fine-tuning stage. Moreover, REFERS exhibits the greatest performance improvements with respect to these baselines when only 800 training images (1% label ratio) in the target domain are utilized. For example, REFERS surpasses the widely adopted ImageNet-based pre-training11 by about 7% on average. REFERS even gives quite competitive results when compared with LSP. Table 2 shows that the average performance of REFERS actually surpasses LSP, consistently maintaining an advantage of at least 2%. Compared with self-supervised baselines13,14,15,16 and ImageNet-based pre-training11, REFERS achieves the largest improvements on emphysema (7%) and cardiomegaly (>10%), especially under limited supervision. When compared with LSP, our method achieves consistent improvements on mass (>4%).

### VinBigData Chest X-ray Abnormalities Detection

REFERS exhibits a greater advantage on this target domain dataset than it does on NIH ChestX-ray, as VinBigData comprises a much smaller number of annotated radiographs (about one-eighth of the NIH dataset). This again demonstrates REFERS’s ability to manage with limited supervision. REFERS consistently maintains large advantages over other methods under different conditions (see Tables 1 and 2, and Extended Data Figs. 3 and 4). For instance, when we only have 105 annotated radiographs (1% label ratio) as fine-tuning data, REFERS surpasses C2L16—the best-performing self-supervised method—by over 7% in AUC. The performance of REFERS once again surpasses LSP with human-assisted structured labels even when all-annotated training data (100% label ratio) in the target domain are used. When we check specific abnormalities and diseases, we found REFERS consistently improves the diagnosis of atelectasis, lung opacity and pneumothorax in comparison to LSP.

### COVID-19 and Shenzhen Tuberculosis image collections

Both datasets serve as target domains and comprise a small number of labelled images (fewer than 1,000 X-rays), which are employed to test the transferability of the representation learned on the source domain. We adopted these two datasets as target domains because the few training images in such small target domains are not capable of training powerful models themselves; thus, the performance of the trained models is more dependent on the quality of the learned representation. In Table 1, although separating tuberculosis from normal cases is not a hard task, our method still achieves 2.5% improvements over C2L16 in AUC. When looking at the COVID-19 Image Data Collection dataset, which includes two harder tasks, we can find that the relative performance improvements over self-supervised baselines13,14,15,16 and transfer learning on natural source images11 become quite clear. For instance, on the viral versus bacterial task, REFERS outperforms C2L16 by 7% in AUC, demonstrating the effectiveness of REFERS in helping achieve better performance over small-scale target datasets. Even if we compare REFERS against LSP, the performance advantage is still maintained at more than 1%.

## Discussion

### REFERS outperforms self-supervised learning and transfer learning on natural source images by substantial margins

This is the most prominent observation obtained from our experimental results, which holds on different datasets and with different amounts of annotated training data during fine-tuning. Among self-supervised baselines13,14,15,16, C2L16 and TransVW15 are the two best-performing methods. REFERS outperforms C2L and TransVW by at least 4% when very limited annotated training data (at most 10% label ratio) from the NIH ChestX-ray and VinBigData datasets are used. Somewhat interestingly, as the label ratio increases, ImageNet-based pre-training11 gradually narrows the gap on self-supervised learning. Nonetheless, REFERS still surpasses it by a large margin (at least 4%). Similar results can also be observed on the Shenzhen Tuberculosis and COVID-19 Image Data Collection datasets. As REFERS employs a cross-supervised learning manner, it does not require structured labels as conventional fully supervised learning approaches. As radiographs and radiology reports are readily available medical data, we believe our approach is as practical as self-supervised learning methodologies in real-world scenarios.

### REFERS consistently surpasses label-supervised pre-training with human-assisted structured labels

This is another clear observation obtained from our experimental results. Although our approach does not use any structured labels in the source domain, our pre-trained model exhibits clear advantages over all four target domain datasets. Specifically, REFERS outperforms the most competitive LSP method, LSP (Transformer), which is based on transformer- and human-assisted structured labels in the source domain. In particular, our method shows more advantages at small label ratios. For instance, when NIH ChestX-ray and VinBigData are used as target domain datasets, REFERS achieves about 2.5% improvements when the number of training images is smaller than 10,000. Similarly, REFERS consistently surpasses LSP by significant margins (P-values < 0.01) on the Shenzhen Tuberculosis and COVID-19 Image Data Collection datasets. It is worth mentioning that when a classification problem is difficult to solve and has limited supervision, REFERS becomes more advantageous and achieves impressive improvements. For example, on the viral versus bacterial task (the last column in Table 2), REFERS surpasses label-supervised pre-training methods based on two-stage human intervention by approximately 4%. These improvements demonstrate that raw radiology reports contain more useful information than human-assisted structured labels. In other words, the advantages exhibited by our approach on small-scale target domain training data can be attributed to the rich information carried by radiology reports in the source domain. Such information provides additional supervision to help learn transferable representations for radiographs, whereas the supervision signals from structured labels have less information. We believe that this is an important step towards directly using natural language descriptions as supervision signals for image representation learning. As an example, REFERS can be used to learn natural image representations from text descriptions at corresponding websites.

### REFERS reduces the need of annotated data in target domains

Figure 2a,b presents the performance of our approach under various label ratios. On the NIH ChestX-ray dataset, REFERS needs 90% fewer annotated target domain data (10% label ratio) to deliver performances comparable with those of Model Genesis14 and ImageNet-based pre-training11. Similarly, on VinBigData, our method only needs 10% annotated training data to achieve much better results than those of Model Genesis and ImageNet-based pre-training under 100% label ratio. This phenomenon shows the potential of REFERS in providing high-quality pre-trained representations for downstream fine-tuning tasks with limited annotations. Due to the difficulty to acquire reliable annotations for medical image analysis, the ability to achieve good performance with limited annotations means much to the community.

### Improvements on specific abnormalities and diseases

In Extended Data Fig. 2, REFERS brings a 5% performance gains on emphysema and mass even when compared with LSP with limited supervision in the target domain (<10,000 training images). As both abnormalities have a dispersed spatial distribution in the lung area, the considerable improvements demonstrate that REFERS is able to handle elusive chest abnormalities in radiographs well. REFERS becomes more advantageous when the amount of supervision in the target domain becomes extremely limited (for example, when using 105 training images from VinBigData). For instance, REFERS outperforms LSP on atelectasis and pneumothorax by over 7% and 9%, respectively (Extended Data Fig. 4). Unlike emphysema, mass and atelectasis, pneumothorax maintains a concentrated spatial distribution and is often located around the pleura. These successes imply that REFERS can deal with the diagnosis of both elusive and regular abnormalities and diseases well using a small number of training radiographs in the target domain. A similar phenomenon can be observed when REFERS is used for distinguishing viral pneumonia cases from bacterial cases in Tables 1 and 2.

### Transformer is more effective under limited supervision

We observe a trend of CNNs (that is, ResNet series4) in Tables 1 and 2: LSP (ConvNet) shows mediocre performance when a relatively small number of training images in the target domain are used; however, when all training data (100% label ratio) are used, ConvNet shows competitive results. It seems that LSP (ConvNet) cannot manage small amounts of supervision well. By contrast, LSP (Transformer) exhibits much better performance at small label ratios. This comparison demonstrates that pre-trained transformers generate more transferable representations than pre-trained CNNs. The underlying reason might be that the self-attention mechanism in transformers makes the learned representations more transferable due to captured long-distance dependencies.

### REFERS provides reliable evidences for clinical decisions

Figure 3 presents randomly chosen radiographs and their corresponding class activation maps27. We can find that REFERS generates reliable attention regions, on top of which we can apply a fixed confidence threshold to further identify the location of different types of lesions (green boxes in Fig. 3). The overall intersection over unions between green and red boxes (drawn by radiologists) are mostly higher than 0.5, indicating that the generated attention regions can well match radiologists’s diagnoses. When lesions have a large size (for example, the fifth image from NIH ChestX-ray, i.e., Fig. 3e), our method captures well-aligned lesion areas. Even when lesions are quite small and therefore hard to detect (such as the last image from NIH ChestX-ray and the first image from VinBigData), REFERS can still identify the right locations.

### Replication of experimental results and their statistical significance

There are a number of factors that influence pre-training results exhibit a certain level of randomness. These factors include—but are not limited to—network initialization, training strategy (for example, how to randomly crop images and perform mini-batch gradient descent) and even non-deterministic characteristics in computational tools (for example, cuDNN 28 would choose different algorithms in different runs due to benchmarking noise and hardware configuration). A good pre-training methodology should be able to produce relatively stable pre-trained representations when randomness in these factors is controlled within an acceptable limit. To take into account the influence of such randomness on experimental results, when REFERS and baseline pre-trained models are fine-tuned, we independently repeat each experiment three times and report their average results in Tables 1 and 2. We then calculate P-values between mean class AUCs of REFERS and the best-performing baseline model according to their fine-tuned performance using independent two-sample t-test. According to Tables 1 and 2, nearly all P-values are much smaller than 0.01, indicating that REFERS is significantly better than its counterparts when various amounts of labelled training data in the target domain is used. By contrast, making the number of times (repeating each experiment) smaller than three would give rise to less stable mean AUCs while simply repeating more times would produce meaninglessly smaller P-values.

Last but not the least, we provide a thorough ablation study of REFERS in Table 3. More details can be found in the Methods.

## Methods

### Dataset for pre-training (source domain)

MIMIC-CXR-JPG23 contains over 370,000 radiographs organized into patient studies, each of which may have one or more radiographs taken from different views, or at different times for the same patient. Each patient study has one free-text radiology report and each radiograph is associated with a set of abnormality/disease labels obtained from two-stage human-assisted intervention, as mentioned above. There are two major sections in each report: findings and impressions. The former includes detailed descriptions of important aspects in the radiographs, whereas the latter summarizes most immediately relevant findings.

To acquire human-assisted structured labels for radiographs (that is, two-stage human intervention), annotators need to first define a list of labels for abnormalities and diseases, including alternate spellings, synonyms and abbreviations. On the basis of local contexts and existing NLP tools, mentions of labels in reports are classified as positive, uncertain or negative. An aggregation procedure is further applied to aggregate multiple mentions of a single label. Uncertain labels need to be double-checked by radiologists.

As radiology reports were originally prepared by radiologists as part of the daily clinical routine, they can be regarded as freely available information that does not require extra human efforts, in contrast to structured labels. In practice, we only keep the findings and impressions sections in the reports. We also remove all study–report pairs—where the text section has less than three tokens (words and phrases)—from the dataset. This screening procedure produces 217,000 patient studies.

### Datasets for fine-tuning (target domains)

We do not require these datasets adopted for fine-tuning to have radiology reports. Instead, only human-assisted annotations are used during the fine-tuning stage. We follow the official split of NIH ChestX-ray, where the percentages of training, validation and testing sets are 70%, 10% and 20%, respectively. The same set of ratios are also employed for the VinBigData Chest X-ray, Shenzhen Tuberculosis and COVID-19 Image Data Collection datasets to build randomly split training, validation and testing sets.

• NIH ChestX-ray is a dataset for multilabel classification of 14 chest abnormalities (that is, atelectasis, cardiomegaly, consolidation, oedema, effusion, emphysema, fibrosis, hernia, infiltration, mass, nodule, pleural thickening, pneumonia and pneumothorax). There are over 100,000 frontal-view X-ray images of about 32,000 patients in NIH ChestX-ray, where labels of radiographs were extracted from associated reports following a similar procedure as that for MIMIC-CXR-JPG.

• VinBigData Chest X-ray provides labels of 14 chest diseases (that is, aortic enlargement, atelectasis, pneumothorax, lung opacity, pleural thickening, interstitial lung disease, pulmonary fibrosis, calcification, pleural effusion, consolidation, cardiomegaly, other lesion, nodule-mass and infiltration) and consists of 15,000 postero-anterior chest X-ray images. Here we did not use the test set in Kaggle, which does not provide any annotations. All images were labelled by a panel of experienced radiologists.

• Shenzhen Tuberculosis is a small dataset containing 662 frontal chest X-ray images primarily from a hospital clinical routine; 336 abnormal X-rays show various manifestations of tuberculosis, whereas the remaining 326 images are normal. We simply perform binary classification on this dataset.

• COVID-19 Image Data Collection is a dataset involving more than 900 pneumonia cases with chest X-rays, which was built to improve the identification of COVID-19. We conduct experiments on two tasks, which (1) distinguish COVID-19 from non-COVID-19 cases and (2) separate viral pneumonia cases from bacterial cases.

### Baselines and label-supervised pre-training

As our method does not need the structured labels that are required by traditional fully supervised learning, we compare it against four recent self-supervised learning methods13,14,15,16 and ImageNet-based pre-training11:

• Context Restoration13 repeats the operation of swapping two randomly chosen small X-ray patches a fixed number of times, and the neural network is asked to restore each altered image back to its original version.

• Model Genesis14 applies multiple types of distortions to the input X-ray, including local shuffling, non-linear transformation, in- and out-painting. Similar to Context Restoration, Model Genesis asks the model to reconstruct the original image from the distorted one.

• TransVW15 contrasts local X-ray patches to exploit the semantics of anatomical patterns while restoring distorted image contents.

• C2L16 proposes to construct homogeneous and heterogeneous data pairs by mixing both images and features on top of MoCo29. C2L outperforms MoCo by observable margins on multiple X-ray benchmarks.

• ImageNet-based pre-training11 is taken as a representative method that sets a large-scale dataset of annotated natural images as the source domain.

Note that all above baselines are implemented using the same transformer-based network architecture as REFERS (that is, a ViT architecture plus the proposed recurrent concatenation module). Such an implementation arrangement is meant to rule out the influence of network architectures on final performance and maintain fairness in experimental comparisons.

Finally, our approach is compared with LSP, which directly sets a large collection of X-ray images with human-assisted structured labels as the source domain. For better comparison, we implement LSP on top of both CNN- and Transformer-based backbone networks. Specifically, LSP (Transformer) adopts the same Transformer-based network architecture as REFERS and the aforementioned self-supervised and ImageNet-based pre-training baselines. LSP (ConvNet) represents the best-performing residual network among ResNet-18, ResNet-50 and ResNet-1014.

### Data augmentation and image resizing

During the pre-training stage, we resize each radiograph in the source domain to 256 × 256 pixels and then apply random cropping to produce 224 × 224 images. Random horizontal flip, random rotation (–10 to 10 degrees) and random grayscale (brightness and contrast) are also applied to generate augmented images. When using random horizontal flip, we change the words left and right in the accompanying radiology report accordingly. During the fine-tuning stage, we apply the same set of data-augmentation strategies—random cropping, random rotation, random grayscale and random horizontal flip—to all four target domain datasets. As in the pre-training stage, we resize each radiograph in a target domain to 256 × 256, and then generate 224 × 224 cropped and augmented radiographs as input images.

### Algorithm overview

REFERS performs cross-supervised learning on top of a transformer-based backbone, called radiograph transformer. Given a patient study, we first forward its views to the radiograph transformer for extracting view-dependent feature representations. We next perform cross-supervised learning that acquires study-level supervision signals from free-text radiology reports. To this aim, it is necessary and essential to use view fusion to obtain a unified visual representation for an entire patient study as each radiology report is associated with a patient study but not individual radiographs within the patient study. Such fused representations are then used in two tasks during the pre-training stage: report generation and study–report representation consistency reinforcement. The first task takes the free texts in original radiology reports to supervise the training process of the radiograph transformer. The second task reinforces the consistency between the visual representations of patient studies and the textual representations of their corresponding reports.

The radiograph transformer accepts image patches as inputs. We divide each image into a grid of 14 × 14 cells, each of which has 16 × 16 pixels. We then flatten each image patch to form a one-dimensional vector of pixels and feed it into the transformer. At the beginning of the transformer, a patch-embedding layer linearly transforms each one-dimensional pixel vector into a feature vector. This vector is concatenated with a position feature produced from a learnable position embedding to help clarify the relative location of each patch in the whole input patch sequence. The concatenated feature is then passed through another linear transformation layer to make its dimensionality the same as that of the final radiograph feature. We stack twelve self-attention blocks at the core part of the radiograph transformer, which have the same architecture but independent parameters (Fig. 1b). We first follow the practice in ref. 20 to build a single self-attention block and then repeat its operations multiple times. In each block, we apply layer normalization30 before the multi-head attention and perceptron layers, after which residual connections are added to stabilize the training process. In the perceptron layer, we employ a two-layer perceptron with the rectified linear unit31 as the activation function. Moreover, we add an aggregation embedding, which is responsible for gathering the information from different input features. As shown in Fig. 1b, in the last layer, recurrent concatenation is performed to repeatedly concatenate the learned aggregation embedding with the learned representation of every patch. This is different from the operation in ViT19, which only concatenates the aggregation embedding with patch features once.

### Cross-supervised learning

There are two major components in cross-supervised learning: the view fusion module for producing study-level representations and two report-related tasks exploiting study-level information from associated free-text reports.

As aforementioned, we forward all radiographs in a patient study through the radiograph transformer simultaneously to obtain their individual representations. We further employ an attention mechanism to fuse these individual representations and obtain an overall representation of the given study. Supposing a study has three radiographs (that is, views), as shown in Fig. 1c. We first concatenate the features of all views and then feed the concatenated features to a multilayer perceptron to compute an attention value for each view. We next apply the softmax function to normalize these attention values, which are used as weights to produce a weighted version of the individual representations. Finally, these weighted representations are concatenated to form a unified visual feature for describing the whole study. Note that for studies that contain fewer than three radiographs, we randomly select one of the radiographs, and then repeat it once or twice to have a total of three views. For studies that contain more than three radiographs, we randomly select three of them from each study as input views.

We design two report-related tasks that acquire cross-supervision signals from free-text reports: report generation and study–report representation consistency reinforcement. In practice, these two tasks exploit study-level free-text information to better train study-level visual representations produced from the view fusion module. The first task applies a decoder called report transformer to the unified visual feature vk of the kth patient study to reproduce its associated radiology report, denoted as $${c}_{1:T}^{k}$$. Here, $${c}_{1}^{k}$$ and $${c}_{T}^{k}$$ represent the start- and end-of-sequence tokens, respectively. As a result, the report transformer generates a sequence of token-level predictions, $${\hat{c}}_{1:T}^{k}$$, for the kth patient study. The prediction of the tth token in this sequence depends on the predicted subsequence $${\hat{c}}_{1:t-1}^{k}$$ and the visual feature vk. The network architecture of the report transformer follows the architecture of the decoder in ref. 20. We wish the predicted token sequence ($${\hat{c}}_{1:T}^{k}$$) resembles the sequence ($${c}_{1:T}^{k}$$) representing the original report of the kth patient study; therefore, as shown in Fig. 1d, we apply a language modelling loss to both $${\hat{c}}_{1:T}^{k}$$ and $${c}_{1:T}^{k}$$ to maximize the following log-likelihood of the tokens in the original report.

$${{{{\mathcal{L}}}}}_{\,{{\mbox{language}}}\,}^{k}=\mathop{\sum }\limits_{t=2}^{T}{{\mathrm{log}}}\,P\left({c}_{t}^{k}| {\hat{c}}_{1:t-1}^{k},{{{{\bf{v}}}}}^{k};{\phi }_{v},{\phi }_{t}\right),$$
(1)

where P denotes conditional probability, $${\hat{c}}_{1}^{k}$$ is a special symbol indicating the start of the predicted sequence, and ϕv and ϕt are the parameters of the radiograph and report transformers, respectively.

For the second task on study–report representation consistency reinforcement, we employ a contrastive loss32 to align cross-modal representations. Here we use tk to represent the textual feature vector of the kth radiology report. In practice, we obtain tk by forwarding the sequence of tokens in the kth report (that is, $${c}_{1:T}^{k}$$) to a bidirectional encoder representations from transformer (BERT) model33. BERT is built on top of the encoder in ref. 20 using large-scale pre-training on a great number of corpus resources; thus, BERT can help produce a generalized textual representation for the input report. Suppose we have B patient studies in each training mini-batch, as shown in Fig. 1d. The contrastive loss for the kth study can be formulated as

$${{{{\mathcal{L}}}}}_{\,{{\mbox{contrast}}}\,}^{k}=-{{\mathrm{log}}}\,\frac{{e}^{\cos ({{{{\bf{v}}}}}^{k},{{{{\bf{t}}}}}^{k})/\tau }}{\mathop{\sum }\nolimits_{i = 1}^{B}{e}^{\cos ({{{{\bf{v}}}}}^{k},{{{{\bf{t}}}}}^{i})/\tau }},$$
(2)

where $$\cos (\cdot ,\cdot )$$ is the cosine similarity, $$\cos ({{{{\bf{v}}}}}^{k},{{{{\bf{t}}}}}^{k})=\frac{{({{{{\bf{v}}}}}^{k})}^{\top }{{{{\bf{t}}}}}^{k}}{\parallel {{{{\bf{v}}}}}^{k}\parallel \parallel {{{{\bf{t}}}}}^{k}\parallel }$$, denotes the transpose operation, represents L2 normalization, and τ is the temperature factor. Finally, for each patient study, we simply sum up $${{{{\mathcal{L}}}}}_{\,{{\mbox{contrast}}}\,}^{k}$$ and $${{{{\mathcal{L}}}}}_{\,{{\mbox{language}}}\,}^{k}$$ as the overall loss. During the fine-tuning stage, we typically use the cross-entropy loss for model tuning.

### Training and testing methodologies

We first pre-train the radiograph transformer on the source domain and then fine-tune it on downstream target domain datasets to verify the quality of pre-training. During the pre-training stage, we sample 4,600 studies to form a held-out validation set according to the official division of the MIMIC-CXR-JPG dataset23. We train the entire network using stochastic gradient descent (SGD) while setting the momentum value to 0.9 (ref. 34) and the weight decay to 1 × 10–4. Following ref. 33, we do not apply weight decay to layer normalization and the bias terms in all layers. We use a fixed batch size of 32 for 300,000 iterations (about 45 epochs). We calculate the validation loss after each epoch and save the checkpoint that achieves the lowest validation loss. We adopt the linear learning rate warm-up strategy35 for the first 10,000 iterations, and then switch to cosine decay36 until the end. Empirically, we found that training the radiograph transformer requires a large learning rate for fast convergence; thus, its learning rate is set to 3 × 10–3, and set to 3 × 10–4 for the report transformer and BERT. We initialize the aggregation embedding to all zeros while randomly initializing all position embeddings. We use PyTorch37 and NVIDIA Apex for mixed-precision training38. The complete pre-training process on the MIMIC-CXR dataset takes about two days on a single RTX 3090 GPU.

During the fine-tuning stage, we fine-tune all transformer-based models (including transformer-based baselines) using SGD with the momentum set to 0.9 and the initial learning rate set to 3 × 10–3 for all datasets. We fine-tune ResNet models using Adam39 instead of SGD, and set the initial learning rate to 1 × 10–4. All downstream models use the same learning rate decay strategy as that used in the pre-training stage, and are trained with a batch size of 128.

### Ablation study

We conduct a thorough ablation study of REFERS by removing or replacing individual modules; the results are shown in Table 3.

First, we investigate the impact of replacing the radiograph transformer (rows 1 and 2 in Table 3). If we replace the radiograph transformer with ResNet-1014 (row 1), the overall performance of REFERS on COVID-19 Image Data Collection would drop by about 7% (compared with row 0). This comparison demonstrates that the radiograph transformer is more effective at dealing with limited annotations, which is also verified by the results in Tables 1 and 2. Next, when we replace the radiograph transformer with the original ViT architecture (row 2), which does not have the recurrent concatenation operator, the overall performance would drop by 3.3%. This result verifies the helpfulness of recurrently concatenating the learned aggregation embedding with patch representations. We also note that there exists a 3.8% performance difference between ResNet and ViT based architectures (rows 1 and 2), showing the advantage of a transformer-like architecture.