Introduction

Medical image registration is a process that aligns one image to another originating from the same or different modality. This aligned image contains more spatial–temporal information, which is important for applications such as image guided surgery1, disease monitoring2 and risk prediction3. Registration between images of the same modality is mono-modal registration, and registration between images of different modalities is multimodal registration. Different imaging techniques are sensitive to different tissues in the body. Therefore, images of different modalities need to be registered with each other to provide complementary information. However, this is difficult because of the complex relationship between the intensities of the corresponding structures in the two images. Ultrasound (U/S) images are especially challenging due to their large motion, small field of view and low scan quality. Nonetheless, 3D–2D registration is needed. The potential of deep learning on those issues has not been fully reached4. In this work, we proposed a two-step deep learning method to address 3D computed tomography (CT) to 2D ultrasound (3DCT-2DUS) kidney registration.

State-of-the-art (SOTA) methods5 can be classified as supervised, weakly supervised and unsupervised registration, according to learning strategy or convolutional neural network (CNN)-based, deep adversarial network-based, and transformer-based image registration, according to baseline network architecture. The supervised registration 6 is trained to predict the transformation by using images and their ground truth transformations. Weakly supervised registration7,8,9 uses overlapping segmentations of anatomical structures as a loss function, which lowers the limitations associated with ground truth data. Unsupervised registration10,11,12,13,14,15 is trained by minimising a dissimilarity measure given a set of images and does not need ground truth transformations. CNN-based image registration16,17 trains a designed CNN architecture and learns the mapping between the input images and the deformation fields. Deep adversarial image registration18,19 consists of a generator network and a discriminator network. The generator network is trained to generate transformations and the discriminator network learns similarity metric to ensure that the generated transformations are realistic, or the input images are well registered. Vision Transformer (ViT)-based registration20,21,22,23,24 learns the inherent relationships among data through the attention mechanism. Our solution is CNN-based unsupervised registration. We refer to registration as unsupervised learning because the registration subnet is under unsupervised training. The feature subnets are trained separately and not specifically for the registration task. They are independent feature extractors, and universal features are also applicable to our solution.

Most 3D–2D registration is supervised projective registration. The 2D image is the projection of the 3D volume. Miao6 proposed using CNN regression to register 2D X-ray images with 3D digitally reconstructed radiograph (DRR) images. Ground truth transformations were available for their application. Foote25 proposed tracking tumours using a single fluoroscopic projection via supervised learning-based method. The method densely sampled the CT volumes and calculated DRR projections with a linear combination of motion components via DenseNet 26. Salehi 27 estimated the pose of 2D MR plane within the MR volume using supervised regression CNN. Liao28 and Krebs29 proposed employing reinforcement learning to conduct 2D/3D registration by learning a series of actions. Our method is sliced 3D–2D registration. No assumption of projective geometry is made, and no ground truth transformation is used for training. In addition, it faces the challenges of a very large potential search space. We address those difficulties using the proposed novel solution. Guo30 proposed aligning 2D TRUS frame with 3D reconstructed TRUS volume using deep registration network. Their method used CNN regression to estimate transformation parameters and compared them with ground truth transformation for mean square error loss. It sampled a 2D slice using estimated transformation and compared the slice with 2D TRUS for auxiliary image similarity loss. The author found that unsupervised learning cannot result in stable training. Therefore, the network was trained on mono-modal images under supervised learning with the combined loss. Wei 31 proposed registering vessel labels on 2D U/S images to vessel labels on 3DCT/MR liver images using deep registration, which was a mono-modal approach. They labelled the image manually and did not analyse the complex relationship between the intensities of corresponding structures in multimodality images. The registration model was trained under supervised learning and followed by a conventional plane fitting process.

Deep affine registration uses an encoder structure14 , Siamese encoder structure32 or ViT-based structure33 . They are 3D-3D single modality image registrations and cannot be the full solution for 3DCT-2DUS registration. Moreover, for registration at multiple scales, two or three encoders are stacked together due to capacity limitations. In contrast, we propose using an encoder-decoder structure, which enables multiple-scale registration by hierarchical regression of transformation parameters from decoder layers. It allows more scales and benefits large transformation registration and fast model convergence.

Balakrishnan10 proposed VoxelMorph to perform unsupervised registration. We adopted their baseline architecture and improved it to a hierarchical architecture to generate transformation parameters at multiple scales. Hu8 proposed using image features to guide registration. Our work is conceptually superior to their work8 in several aspects. The first is on architecture. In their work, original images are used to train the registration network, and the ground truth image segmentation labels are used to calculate the loss function. Our work uses a network to represent images and inputs image representations to the registration network. Both voxel-level image content and high-level features contribute to registration. The second is the loss function. A modality-independent neighbourhood descriptor (MIND)34 is used to measure the image similarity of the CT–US pair. Here, we assume that a high-level feature can drive alignment close to its optimum and that MIND loss is continuous locally. A gradient loss is designed to regularise respiration motion smoothness on windowed U/S scans. The third is that they address 3DCT-3DTRUS prostate registration. We address 3DCT- 2DUS kidney using a deep rigid registration. We pretrain a model by using an unsupervised learning cum data generation scheme and refine the model by one-cycle transfer learning. Heinrich35 proposed a discrete 3DCT-CT registration, which used two steps of optimisations, unsupervised learning for global searching and a correlation layer for local optimal search. Our method integrates features, images, and motion metrics into the loss function and conducts one-step transformation estimation.

In this work, we contributed a novel deep learning pipeline for sliced 3DCT-2DUS kidney registration. It overcame two main challenges: registering images of different dimensions and imaging modalities. To address the dimension difference, the U/S images were first expanded to the same dimension of CT by zero padding. Because there were relatively few spatial constraints between CT volume and 2D U/S slices compared with 3D–3D volume registration, it was necessary to move 3D CT effectively. We proposed using a rigid encoder-decoder registration network and hierarchical regression of transformations from each decoder level. Transformations for images from low to high resolutions were combined at the highest resolution via weighted translation and rotation. In addition to hierarchical regression, we designed a combined loss to drive image sequence alignment accurately via global deep kidney features and local modality-independent image features and smoothly via the transformation of consecutive frames. Moreover, to further improve the registration performance and ensure efficiency in clinical applications, we proposed unsupervised training the registration network in two steps: pretraining the model using the general training datasets and adaptively training the model for two epochs using specific patient training datasets by one-cycle transfer learning. Training data generation was proposed to generate image pairs for general training. To address the different imaging modalities, we proposed extracting deep kidney features on CT and U/S images for overall comparison and extracting modality independent image features for comparison with local details. The feature network was designed with handcrafted texture layers to reduce the semantic gap. Furthermore, we applied a time window on the U/S sequence to improve kidney observation on noisy images by including respiration motion information. Generally, the methodology addressed all issues in kidney registration during free breathing. To the best of our knowledge, this is the first deep learning pipeline for sliced 3DCT-2DUS kidney registration.

Methods

The human kidney seldom deforms due to patient posture changes and respiration according to clinicians. Therefore, we used rigid transform in the kidney image registration. The proposed solution consists of 3DULBNet36 and a 3D–2D hierarchical registration network. 3DULBNet was trained on CT and windowed U/S images on binary segmentation tasks separately offline. They were connected with the registration network to predict the CT plane that best aligned with the U/S images.

Feature network

The ULBNet is a 5-level U-Net with a residual block replacing the original convolutional layer (Appendix A). A local binary convolution (LBC) layer37 was added to skip connections. The dropout rate was set to 0.2. For CT images, the patch size is 160 × 160 × 80, and the batch size was 1. The optimiser was Adam. The loss function was the negative Dice coefficient. The output layer was a convolutional layer with sigmoid activation, and its output was a kidney feature map of CT images. The feature map was a probability map of a pixel/voxel being kidney. It described global shape characteristic of the kidney. ULBNet36 can be referenced for details of the method used to process CT images.

U/S images are noisy, and it is difficult to delineate kidneys from a single image. Kidney motion information is useful. Since the U/S sequence scans the kidney in respiration, the kidney motion cannot be observed in one frame but can be observed in a few consecutive frames. The U/S image sequences are experimentally windowed with a size of 5 (Appendix E). Thus, instead of extracting features from the 2D U/S frame, we extracted them from the volume in 256 × 192 × 5. In the U/S feature net, the input size was 256 × 192 × 5 and down/upsampled by 2 at each level. We did not downsample the data in the time dimension. The numbers of feature maps were 16, 32, 64, 128, and 256 in the encoder pathway and 256, 128, 64, 32, and 16 in the decoder pathway. The output layer was a convolutional layer with sigmoid activation and output of a U/S kidney feature map. Kidney feature maps of CT and U/S images were of the same dimension as the input. The windowed kidney feature images were used to construct the CT–US image pair.

Registration network

We present kidney registration (Fig. 1) in five aspects: image pair preprocessing, network structure, loss function, training data generation and learning strategy.

Figure 1
figure 1

The 3D–2D registration network structure.

Preprocessing

All CT images were converted to RAI orientation and isotopically sampled in 0.8 mm × 0.8 mm × 0.8 mm. CT scans were automatically cropped to 128 × 224 × 288 around the centroid. A U/S image is resampled to 0.8 mm × 0.8 mm and cropped to 224 × 288. Windowed U/S images were centred in a 128 × 224 × 288 volume with zero padding. Within one U/S window, the middle was the registration target, and the others contributed to motion regularisation. U/S images were stacked along the time axis, the same as the R-L axis in image space. CT volume was aligned with U/S by matching the kidney centroids from feature maps. Because U/S scanning of the kidney was subject to the constraints of the ribs and spines of patients, the variability of the initial position of 3DCT-2DUS pairs can be large. To reduce variability, we uniformly aligned the kidney on CT in the inferior-superior axis and then aligned kidney on U/S with the centroid. The inputs to the registration network were CT and U/S window images of 128 × 224 × 288. CT was the moving image, and U/S was the fixed image.

Network structure

The architecture is shown in Fig. 1. The CT and U/S volumes were concatenated before going to the convolution layer. The numbers of feature maps were 8, 16, and 16 in the encoder pathway and 16, 16, 16, 16, 8, and 8 in the decoder pathway. Upsampling was performed using Upsampling3D, and downsampling was performed with stride convolution. Transformation regression used the affine block. At each decoder level, the activation of the output (dense) layer was tanh for transformation parameters. The spatial transformer network (STN) layer38 processed rigid transformation because the kidney seldom underwent deformation during free respiration. The optimiser was Adam. The learning rate was 3e−4. Given that the window size of U/S images was Nw, the output transformation was a set of 6 Nw rigid transformation parameters, 3 Nw for rotation and 3 Nw for translation. Nw was experimentally set to 5 in this work (Appendix G). Each layer in the decoder pathway output a set of transformation parameters, and their weighted sum composed the final transformation. For rotation, it was an average of rotation parameters. For translation, it was a weighted sum of the translation parameters, and the weights were {8, 4, 2, 1}/4 = {2, 1, 0.5, 0.25}. Rotation was scale invariance and was averagely weighted. Translation was inversely proportional to the image resolution and was proportionally weighted. Hierarchical activation improved prediction and decreased the training time. The number of trainable parameters was approximately 282 M.

Loss function

The feature map describes the overall kidney, and the MIND feature describes image details. They can complement each other to measure alignment accurately. As registration occurs during breathing, the CT cutting planes obtained should be able to stack together as smooth time sequences. Given \({I}_{fix}\) fixed image, \({I}_{mov}\) moving image, \({M}_{fix}\) fixed feature map, \({M}_{mov}\) moving feature map, and \(\mathcal{D}\) transformation parameters, the feature-image-motion (FIM) loss was defined (Eq. 1) by three measures on the feature map, original image, and transformation:

$$\mathcal{L}\left({I}_{fix},{ I}_{mov},{M}_{fix},{M}_{mov},\mathcal{D}\right)={\mathcal{L}}_{f}\left({M}_{fix},\mathcal{D}\circ {M}_{mov}\right)+{\lambda }_{1}{\mathcal{L}}_{i}\left({I}_{fix},\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov})\right)+{{\lambda }_{2}\mathcal{L}}_{d}\left(\mathcal{D}\right)$$
(1)
$${\mathcal{L}}_{f}\left({M}_{fix},\mathcal{D}\circ {M}_{mov}\right)=-\frac{1}{{N}_{w}}\sum_{i\in {N}_{w}}{DICE}_{i}\left({M}_{fix},{\mathcal{D}}_{i}\circ {M}_{mov}\right)$$
(2)
$${\mathcal{L}}_{i}\left({I}_{fix},\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov})\right)=\frac{1}{\left|N\right|}\sum \left|MIND\left({I}_{fix}\right)-MIND(\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov}))\right|$$
(3)
$${\mathcal{L}}_{d}\left(\mathcal{D}\right)=0.01 \, * \, \frac{\Vert \mathcal{D}-{I}_{4\times 4}\Vert }{{N}_{w}}+0.99 \, * \, grad\mathcal{D}, \,\,and \,\,grad\mathcal{D}=\frac{1}{{N}_{w}-2}\sum_{i-1,i,i+1\in {N}_{w}}\Vert {D}_{i+1}+{D}_{i-1}-2{D}_{i}\Vert.$$
(4)

The feature loss was the negative Dice coefficient of the fixed and warped kidney feature map (Eq. 2), where \(DICE\left(x,y\right)= \frac{1}{m}\sum \left(\frac{2x \odot y}{x \oplus y}\right)\), and are elementwise operations: multiplication and addition, and m is the number of elements. DICE was calculated between the middle slice in R-L axis of warped CT volume and corresponding slice in fixed U/S volume. The feature loss was an average DICE within time windows. The image loss was the MIND feature difference between the fixed and warped original image (Eq. 3)39. A MIND feature was calculated by a Gaussian function of the mean squared difference between a central patch of the image and one of its six neighbouring patches. The neighbourhood was in 3D and the patch size was 3 × 3 × 3. MIND features were extracted from CT volume, and from U/S volume to capture 3D local image information as the kidney moved. U/S volume had 5 frames, and the feature of its middle frame was valid and used to calculate the image loss. The image loss was the mean absolute difference between the MIND features of middle slice in R-L axis in the warped CT volume and that in the fixed U/S volume to compare two images with local details. N was the set of displacement vectors. Here, the CT images needs to be masked before calculating MIND because CT images display extra anatomic structures in addition to the kidney. The transformation loss (Eq. 4) was a weighted sum of the L2-norm divided by the number of parameters and the average transformation difference (Eq. 4). \({\lambda }_{1} , {\lambda }_{2}\) were empirically set as 0.01 and 0.001. Here, we assumed that MIND is locally continuous. \({\lambda }_{1}\) was set to 0.01 to ensure that feature loss dominates weight updates initially, and when approaching the global optimum, image loss took effect to fine tune the weights. The transformation loss regularised the motion transformation to move CT to its optimal position aligned with U/S images. Basically, feature loss, \({\mathcal{L}}_{f}\), was calculated on kidney feature maps. Image loss, \({\mathcal{L}}_{i}\) , was computed between U/S images and masked CT images, which were implemented by elementwise multiplication of the warped CT images and the warped CT kidney feature map. Loss was calculated on kidney regions, thus eliminating influence from the difference between the field of views in CT and U/S images.

Training data generation

The transformation was in 6-dimensional space, while the data size was relatively small. Network training was prone to overfitting. Training dataset generation was necessary and was employed in projective 3D–2D image registration25,40. Dense sampling was commonly used. Unlike projective registration, the transformation parameters or their distribution for sliced 3D–2D registration were uncontrollable and unavailable. We needed to find it out. First, we verified reference planes with clinicians (Appendix B). The transformations to obtain those verified alignments were modelled, parameter-by-parameter, in 2-sigma Gaussian distributions, where we randomly sampled Nt transformations individually (Appendix C). The Nt inverse transformations transform the optimal alignment back to Nt different initial kidney positions. We generated Nt training CT–US pairs by applying Nt inverse transformations on a reference CT volume (Appendix D). Our training data generation scheme helped to obtain realistic training datasets to overcome overfitting. In the future, if sufficient clinical data are available, data generation can be neglected.

Learning strategy

Based on the generated training datasets, unsupervised learning was employed to pretrain a registration model. Our target U/S images are respiration sequences including images of periodic breathing cycles. It was possible to make use of the onsite patient-specific dataset to further improve the model performance. We proposed a one-cycle transfer learning strategy, which refined the pretrained model using the first respiration cycle data via transfer learning with two epochs and inferred the transformations subsequentially. On-site patient-specific training without pretrained model was infeasible because the convergence time was too long to accept, and a long operation preparation time was impractical in clinical applications. We proposed using transfer learning to refine the model to save time and improve performance (Appendix F).

Evaluation metrics

The Hausdorff distance (HD) and mean contour distance (MCD) between the outlines of the kidney on CT and U/S images were used to evaluate if the CT–US pair was well aligned, and calculated as

$${D}_{Hausdorff}\left(U,C\right)=\underset{}{\mathrm{max}}\left\{\underset{c\in C}{\mathit{max}}\left\{\underset{u\in U}{\mathit{min}}\left\{d\left(c,u\right)\right\}\right\},\underset{u\in U}{\mathit{max}}\left\{\underset{c\in C}{\mathit{min}}\left\{d\left(u,c\right)\right\}\right\}\right\},\,\,{D}_{mean}=\underset{u\in U, c\in C}{\mathrm{mean}}\{d\left(u,c\right)\}.$$

\(d\left(u,c\right)\) is absolute value on the distance map, and c and u are contour points on the CT and U/S images, respectively. HD and MCD were in millimetres. The CT to U/S (CT–US) distance was calculated between kidney boundaries on CT and U/S images. The CT to CT (CT–CT) distance between kidney contours on the resulting CT plane and reference CT plane.

Ethics declarations

The in-house datasets were collected from Fifth Affiliated Hospital Guangzhou Medical University and approved by the Institutional Review Board on August 28, 2020, with protocol number L2020-24.

Experimental data

Clinical datasets

The datasets consisted of public KiTS1941 datasets and in-house datasets. Public datasets consisted of 210 corticomedullary phase (CMP) CT images. The in-house datasets were collected from Fifth Affiliated Hospital Guangzhou Medical University and approved by the Institutional Review Board on August 28, 2020, with protocol number L2020-24. The datasets were consecutively studied from January to May 2021, consisting of 132 U/S image sequences (more than 30K images) from 31 patients, 39 multiple phase CT images from 24 patients, and 25 pairs of CT volumes and U/S sequences from 25 patients. All images were de-identified. The CT scans were acquired on one 64-slice scanner (GE OPTIMA CT600 CT scanner), using a standard four-phase contrast-enhanced imaging protocol with a slice thickness of 0.6–5.0 mm, a matrix of 512 × 512 pixels and an inplane resolution of 0.625–0.916 mm. CMP scanning was performed with 180 HU detected at the region of interest within the abdominal aorta. The nephrographic phase (NP) was performed 28 s post contrast, and the excretory phase (EP) was performed 10–30 min post contrast. The U/S datasets were acquired on a GE Versana Active™ ultrasound system, with a matrix of 1132 × 852 pixels and an inplane resolution of 0.22–0.29 mm. The U/S sequences were scanned at 17–22 frames per second and had 58 ± 14 U/S images per respiration cycle.

Generated training datasets

Reference planes

For each U/S frame, a manually selected reference CT cutting plane with the overlap of kidney boundaries was displayed side-by-side to four experienced clinicians to unanimously verify if they were the same cutting plane of the kidney. The verified planes constructed our basic reference set, from which we extended the training set. There were 22 out of 25 pairs of CT volume and U/S sequences verified by clinicians that their 3D cutting plane in CTs was the same as that in U/S images.

Transformation parameter estimation

Six parameters in transformations that resulted in the reference CT planes from initial positions were modelled in 2-sigma Gaussian distributions. The clinical datasets were split in a ratio of 7:1:2 for training, validation and testing sets according to patients. They were normalised to pixels having a zero mean value and one variance value. Fivefold cross validation was used to evaluate the model’s performance. The patient’s data was split into five groups. One group was testing dataset and the remaining groups were train/validation dataset. The train/validation dataset was randomly split. The performance of the model was averaged over five runs. Only training datasets were generated N times.

Experiments and results

Evaluation on feature network

Our network was implemented on TensorFlow, and training was performed on a workstation with a dual Nvidia Quadro RTX 5000 of 16 GB and CPU memory of 256 GB. The learning rate was 1 × 10–4. The model was evaluated by

$$DICE=\frac{2 \left|Y\cap {Y}^{*}\right|}{\left|Y\right|+|{Y}^{*}|},\,\, Sensitivity=\frac{\left|Y\cap {Y}^{*}\right|}{\left|Y\right|}, \,\, Specificity=\frac{\left|Y\cap {Y}^{*}\right|}{|{Y}^{*}|}, \quad \mathrm{ where }\,{Y}^{*}\,\mathrm{ is \,prediction\, and\, }Y\,\mathrm{ is\, ground \,truth}.$$

The model resulted in an average Dice coefficient of 96.88% on kidney segmentation in the plain phase CT images (Table 1). The network resulted in an average Dice coefficient of 96.39% on U/S images.

Table 1 Feature network performance on kidney segmentation in CT and U/S images.

Evaluation on registration network

The generated datasets were only used for training. The more datasets generated, the smaller the MCD achieved. Here, we generated approximately 12,000 training data pairs, 10 times of clinic datasets, to pretrain the registration network. We conducted an ablation study on the method (Table 2). All the network components contributed to performance improvement. The hierarchical transformation regression at decoder layers contributed more than the MIND loss. One-cycle transfer learning contributed the most. The uncertainty estimation was in Appendix H.

Table 2 CT–US kidney registration performance in the ablation study with individual components removed.

Result comparison

As our deep learning-based pipeline is the first for 3DCT-2DUS kidney registration, we can only compare our method with SOTAs 14,32,33 in terms of the registration module (Table 3). We replaced our encoder-decoder hierarchical registration subnet with the encoder structures or transformer-based registration module. That is, the input to SOTA was CT–US feature pairs after global alignment. VoxelMorph, ConvNet-affine14, VTN-affine32, and C2FViT33 were trained on the general training datasets to converge. VoxelMorph uses the same affine block as in our registration network to obtain rigid transformation parameters. ConvNet-affine and VTN-affine were implemented based on their papers while rigid transformation was employed.

Table 3 Registration performance comparison with SOTAs using a one-step learning strategy (without one-cycle transfer training applied) and using a two-step learning strategy (with one-cycle transfer learning applied).

Ours, with two-step training, was superior to all SOTAs when measured by HD and MCD (Table 3). The 2D CT–US distances were smaller than the 3D CT–CT distances because the distance in one dimension was overlooked. With a one-step training strategy, our pretrained model performed better than VoxelMorph with a smaller 3D CT–CT distance and a larger 2D CT–US distance. This result indicated that hierarchical structures prevented convergence to a local optimum. In addition, it was found that our pretrained model with hierarchical structures converged at approximately 20–50 epochs during training, much faster than VoxelMorph, converging at approximately 200–300 epochs. In addition, the transformer-based method, C2FViT, performed better than CNN-based methods with a one-step training strategy.

We compared our method with SOTAs using a two-step training strategy, all with a one-cycle transfer learning strategy applied (Table 3). Ours learned most from transfer learning to improve performance. ConvNet, a Siamese encoder structure, was second only to ours. C2FViT learned the least. The CNN-based method performed better than the transformer-based method with one-cycle transfer learning for two epochs. Our method performed the best. Example results were in Appendix I.

We divided the CT–US image pairs into two groups sorted by transformations. Group A comprised datasets of small transformations, rotations of 10.37° ± 2.24° and translation of 3.69 ± 0.95 mm. Group B comprised datasets of large transformations, rotations of 24.72° ± 2.28° and translation of 5.04 ± 1.20 mm. Rotation and translation were separately calculated as the L2-norm of components in the x, y, and z directions. All deep learning-based methods used a two-step learning strategy. The performance of the two groups was measured (Table 4). Our method performed best on both Groups A and B. It was robust to large transformations.

Table 4 Registration performance on datasets with small and large transformations.

Example testing sequences of CT and U/S images were displayed in RAI orientation (Fig. 2). U/S frames and resulting CT planes in sagittal view were stacked along the R–L orientation. The axial and coronal views provided dynamic information and the R–L axis represented time. The coronal view displayed the up and down motion of the kidney during respiration, and the axial view displayed the back-and-forth motion of the kidney.

Figure 2
figure 2

Example registration results, from left to right, the images are CT axial view, CT sagittal view, CT coronal view, U/S axial view, U/S sagittal view, U/S coronal view. Rows 1 and 3 visualized internal alignment of renal stone (crosshairs in the sagittal view). Rows 2 and 4 visualized slice alignment on inspiration and expiration (crosshairs in the coronal view).

Finding corresponding landmarks between CT and U/S was difficult. We identified a special case that had a small lesion visible on both the U/S images and CT images during breathing (Fig. 3). The lesion was approximately 7 mm in diameter on the left kidney. It was interesting to know the tumour distance after registration for this special case. If we assumed that the centre of the lesion observed on U/S scan was also its centroid on CT volume, the tumour centre distance was 3.39 mm after registration. 4.38 mm, 6.04 mm, 4.16 mm, and 3.91 mm resulted from VoxelMorph, C2FViT, VTN-affine, and ConvNet-affine, respectively. The distance was large because the tumour was near the kidney surface.

Figure 3
figure 3

Example image pair to visualize target registration error (TRE) on (a) a small lesion in registered CT image and U/S image, (b) small lesion’s motion in respiration.

Performance comparison with 2D/3D registration methods

After unsupervised affine registrations were compared, in this section, we presented the performances of existing 2D/3D registration methods in Table 5. SOTA methods addressed deep 2D/3D registration using supervised CNN regression6,25,27,30, supervised reinforcement learning28,29, or supervised CNN segmentation cum conventional plane regression31. They were all supervised learning. Our method used unsupervised end-to-end convolutional neural network which output both transformation parameters and resulting cutting plane, was trained without the ground truth transformations, and achieved comparable performance with the supervised methods.

Table 5 Registration performance of existing 2D/3D deep learning methods.

Discussion

It is challenging to obtain perfect spatial correspondence for CT and U/S images. We measured the distance on the reference CT–US pairs; the average HD was 3.57 ± 1.11 mm, and the MCD was 0.79 ± 0.22 mm, which was approximately one pixel in size. The nonzero distance may be due to imperfect contour extraction, rib occlusion, or different patient postures.

The pretrained model benefits using training data generation, because the transformation parameters are in high dimensional space while the data size is small. However, only increasing the generated datasets makes it difficult to improve the model performance further. We proposed using transfer learning to achieve this goal.

Transfer learning may need extra preparation/training time before application. With a pretrained model, the preparation can be shortened significantly. For example, if the registration model was trained from scratch using one-cycle learning, it took approximately 46 min to converge, while only 2–3 min were needed if the model was trained using transfer learning. Thus, a good pretrained model was essential to practical applications.

Even though optimal alignments verified with clinicians were available, we did not change our unsupervised learning to supervised learning. First, the data size was quite small, and much effort was required to obtain more data. It was not desirable to limit our model from processing versatile pairing data when available. The increased training data generated a regularisation effect, which benefited the cost function optimisation, and reduced of overfitting and model generalization. We believe that in the future, it will be possible to collect more paired datasets to overcome overfitting resulting from the dataset’s limitation.

There was conventional method proposed by Wein42 on the 3DCT-3DUS kidney registration. The U/S images were acquired using a tracked probe during breath-hold on inspiration. Optimisation was performed by exhaustive search of translation space. In this work, we aimed at a deep learning-based 3DCT-2DUS kidney registration during breathing, which performed deep model inference.

Conclusions

To the best of our knowledge, this paper presented the first deep learning pipeline for sliced 3DCT-2DUS kidney registration. All difficulties in kidney registration during free breathing were addressed via novel network structures and training strategies. Comprehensive experiments showed that our proposed methodology performed well (Supplementary Information: Appendix).