Deep action learning enables robust 3D segmentation of body organs in various CT and MRI images

In this study, we propose a novel point cloud based 3D registration and segmentation framework using reinforcement learning. An artificial agent, implemented as a distinct actor based on value networks, is trained to predict the optimal piece-wise linear transformation of a point cloud for the joint tasks of registration and segmentation. The actor network estimates a set of plausible actions and the value network aims to select the optimal action for the current observation. Point-wise features that comprise spatial positions (and surface normal vectors in the case of structured meshes), and their corresponding image features, are used to encode the observation and represent the underlying 3D volume. The actor and value networks are applied iteratively to estimate a sequence of transformations that enable accurate delineation of object boundaries. The proposed approach was extensively evaluated in both segmentation and registration tasks using a variety of challenging clinical datasets. Our method has fewer trainable parameters and lower computational complexity compared to the 3D U-Net, and it is independent of the volume resolution. We show that the proposed method is applicable to mono- and multi-modal segmentation tasks, achieving significant improvements over the state-of-the-art for the latter. The flexibility of the proposed framework is further demonstrated for a multi-modal registration application. As we learn to predict actions rather than a target, the proposed method is more robust compared to the 3D U-Net when dealing with previously unseen datasets, acquired using different protocols or modalities. As a result, the proposed method provides a promising multi-purpose segmentation and registration framework, particular in the context of image-guided interventions.

Reinforcement learning in medical image processing. In recent years, there have been significant improvements in the capabilities of RL-based systems. With the introduction of deep Q-networks (DQN) 26 separating policy and value networks 27 , and actor-critic networks 28 , RL has shown great success in autonomously acting in simulated constrained environments, e.g., chess and video games. Furthermore, there is research showing that RL also has great potential in medical image processing, e.g., in anatomical landmark detection 29,30 , rigid image registration 31 , and non-rigid image registration 24 . A comprehensive review of RL is out of the scope of this paper, however, we introduce the basic concepts of RL relevant to this work, and adaptation.
Finite Markov decision processes. The fomulation of RL is closely related to the Finite Markov Decision Process (MDP). The Finite Markov Decision Process comprises a set of possible states s t ∈ S t , and a set of possible actions a t ∈ A t at time step t. Here, S t denotes the set of all possible states, while A t refers to the set of all actions. In addition, for each action a t , we have an associated probability π t (a t |s t ) and an associated immediate reward r t (a t |s t ) ∈ R t with R t denoting the set of rewards for all possible actions A t . At each time step t, the environment updates the state s t . This yields a new immediate reward r t for each action a t ∈ A t . The goal of RL is to train this artificial agent such that it learns an optimal policy π * for maximizing the expected cumulative reward r t over time step t. An artificial agent determines in each step the optimal action a * t ∈ A t in state s t in time step t by selecting the action with the highest probability π t (a t |s t ) . In other words, future rewards are taken into account when deciding on an optimal action a * t . This can be expressed using the optimal action-value (Q-value) function, In this equation, the scalar γ ∈ (0, 1] denotes the discount factor. As the name suggests, in finite MDP the sets of possible actions A t and states S t are finite. We use the symbol * to denote the optimal value of a given variable. An artificial agent with hyperparameters θ is trained to approximate the Q * -value, by selecting actions a * t associated with the highest Q-value, at each time step t. This is referred to as Q-learning, and can be formulated as, www.nature.com/scientificreports/ As the goal of the reinforcement learning is to find a * t for each time step t, the agent effectively predicts the trajectory of the actions undertaken. Such an optimization problem is highly non-convex, as different paths could lead to the same destination.
Continuous action space. When applying RL in medical image processing tasks, one of the major challenges is the action space. In finite MDP, the possible actions in any step belongs the discrete finite set A t . This is however not the case for most image processing tasks. In case of a 3D rigid registration application, the possible actions, e.g., rigid transformations in any given state a t ∈ R 6 is in a continuous space.
To deal with a continuous action space, we can either convert the problem to a finite MDP, or work in the continuous action space directly. Discretization 29,31 or dimensionality reduction 24 can be used to construct a finite action set, thereby converting the problem to a finite MDP. Here, the policy is learned using the deep Q-learning approach. The agent predicts the best action a t ∈ A t given the current state s t . Alternatively, we could work with the continuous action space directly. To this end, the agent would need to be implemented as a parameterized actor. The agent would then estimate the action directly for each given state 25,32,33 .
State and observation. Another important difference to finite MDP is that the state in medical imaging applications is defined by the underlying 2D images or 3D volumes. In the implementation of an agent, however, the extracted ROIs in the 2D images or 3D volume are considered the current state. Strictly speaking, it is a partial observation o t of the state. The difference between the state and its observation does not change the general pipeline, as deep RL enables networks to be trained using states or observations. The partial observation effect, however, imposes implicit restrictions on both the optimal policy π * as well as action a * t possible for any given observation o t . The agent can in theory only estimate a sub-optimal policy π t or action a t due to missing information. This may be addressed by designing a suitable multi-scale strategy 30 , wherein, the RL algorithm is formulated to estimate the scale level of the observation.

Contribution.
The main contribution of this paper is a novel point cloud based registration and segmentation method using reinforcement learning. An artificial agent is trained to predict a sequence of actions (in our case transformations) to align a point cloud estimate to the ground truth, given the current state. The agent is implemented as an actor-value network, where the actor estimates possible actions, such as affine or non-rigid transformations for the point cloud, as a finite set A t . The value network, on the other hand, determines the best action. At each time step, the agent predicts both the action and the associated expected action value (Q-value) based on the observed current state. The proposed approach offers several advantages over existing segmentation techniques, such as: (1) compared to conventional model-based segmentation methods, the proposed method does not require a computationally expensive energy function to guide and predict the deformation of the point could (or vertices) in the test phase; (2) in contrast to the CNN based methods, the proposed approach uses partial observations instead of the complete state to estimate the deformation. This contributes to a significant reduction in computational complexity; (3) unlike existing RL-based registration techniques 24,25 , we estimate the action directly for the whole point cloud in a continuous space, and we do not require a predefined deformation model. The separation of the actor and value networks facilitates the estimation of the trajectory and helps to incorporate the future reward. Unlike actor-critic methods, where the actions are taken from a predefined finite set and where the statistical distribution of the policy is learned during the training process, the proposed method predicts actions from a continues space and determines the policy using Q-learning.

Method
We refer to the current distribution of vertices at time step t placed inside a 3D volume as estimates, denoted by V t ∈ R N×3 , where N denotes the number of vertices. The matrix G ∈ R M×3 , referred to as the ground truth, denotes the corresponding target distribution with M vertices. The goal of a model-based segmentation or registration algorithm is to find a transformation function T such that the distance between the transformed estimates and the ground truth is minimized. The function dist(·, ·) calculates the average symmetric distance between two point clouds.
We can reformulate this minimization problem within a reinforcement learning framework as an infinite MDP, where the immediate reward r t denotes the decrease in distance between the estimates V t and the ground truth G . The action a t denotes the possible rigid, affine, or non-rigid transformation to the vertices. To efficiently handle the high dimensional continuous action space, we use actor networks µ a (o t ; θ a ) to map the observation o t in each step t, into a finite set of possible actions, A t , given hyperparameters θ a . Subsequently, value networks µ v (o t , a t ; θ v ) with θ v denoting the hyperparameters are used to predict the associated Q-values, and thereby select the policy π t accordingly. therefore, we could formulate our Q value function as shown in Eq. (5) and use reinforcement learning to solve this optimization problem.
An illustration of the overall pipeline is shown in Fig. 1. For current estimates V t at the time step t, we extract the observation encoding o t . Using this observation o t , the actor networks predict a finite set of actions a t ∈ A t .
(2) θ * =arg min www.nature.com/scientificreports/ Using both the observation o t and action set a t ∈ A t , the value networks predict the associated Q value and determine the policy. Each step is explained in detail in the subsequent sections.
Observation encoding. The observation encoding o t is generated by sampling the underlying environment based on current estimates V t . Depending on the target application, an optional pre-processing step (apart from data normalization) may be applied to the underlying volume. This encoding also depends on whether we are working with a surface mesh or a point cloud. In the case of 3D surface meshes, we calculate the normal vector n i,t for each vertex v i,t ∈ V t . We use an equidistant sampling pattern along the normal vector, and extract pixel values from the environment. When working with unstructured 3D point clouds, we sample a sub-volume centered on each point v i,t ∈ V t , and extract the associated voxel values from the environment. In both cases, the sampled values are vectorized and normalized to zero mean, unit standard deviation, to generate the final image feature. To deal with various voxel spacings in different data, the sampling is done using mm as unit. We also include the spatial positions of the vertices of the estimate V t (and the corresponding normal vector n i,t if available) to describe the distribution of the mesh vertices. They are also normalized into unit sphere, and describe the point feature. The observation encoding o t is a concatenation of these image features and point features, respectively. An example of observation encoding is shown in Fig. 1.
Actor networks. The actor network maps the observation to a possible action set A t . We define the action set A t as the union of a rigid translation action t t = [t x , t y , t z ] T , scaling action s t = [s x , s y , s z ] T , and rotation action R t (r x , r y , r z ) , and a non-rigid deformation D t = [d 1 , d 2 , · · · d N ] T ∈ R N×3 . Using imitation learning, an associated actor network µ a is trained to predict an action a t given a current observation o t . In other words, the actor network is trained to imitate the behavior of a demonstrator for a given observation. In this context, the demonstrator provides the optimal rigid or non-rigid action In the training phase, the ground truth G , describing the correct vertex positions, is given. Therefore, we can calculate the optimal action a * t by minimizing Eqs. (6)- (9). The operation • denotes element-wise multiplication, and N (v i,t ) is the neighbourhood of the vertex v i,t . In case v i,t belongs to a mesh, the N is defined by the adjacent faces of the mesh. When dealing with point clouds, the k-nearest neighbour algorithm is used to determine the neighbourhood N . The regularization term in Eq. (6) makes sure that the deformation vector d i,t for vertex v i,t is similar to the neighboring deformation vector d j,t for vertex v j,t ∈ N (v i,t ) . This ensures the smoothness of the overall deformation. Figure 1. Illustration of the general MDP pipeline using segmentation as the application of interest. The red arrow is for the training phase only and the blue arrows are shared between the training and testing phase. In this example, we segment and register a liver 3D surface mesh to the CT image. The scalar N denotes the number of vertices, the point feature comprises the normalized point coordinates, and the image feature shows the gray values sampled along the normal directions at each point. The actor network estimates the affine and non-rigid action given the observation, and the value network calculates the associated Q-values in each case. In this case, the Q-value for non-rigid action is bigger. Therefore, the non-rigid action will be performed. www.nature.com/scientificreports/ Due to the fact that the observation o t encodes only the neighbourhood, we suffer from a partial observation effect. To facilitate training of the actor, we normalize the output of the demonstrator. The action a * t ∈ A * t is either normalized to a unit sphere ( D * and t * ), or to a specific range ( s * and R * ), and denoted as In this way, the actor network predicts reasonable actions moving along the same direction as those of the demonstrator. At the same time, the predicted actions do not overstep the range of the observation encodings. Subsequently, we introduced an acceleration term β ∈ R + during training, and define the target actions The action loss function L a can thus be defined as In this equation, the first term denotes the L 2 loss computed using the predicted action, and the second term regularizes the norm of the action to β . We also introduce an auxiliary loss that computes the distance between the predicted (V t ) and ground truth (G) vertices as, The scalar 1 and 2 are the Lagrangian multiplier. The combined loss evaluated as L g a = L a + L d , was used for training. Furthermore, we adopted a multi-task training strategy for the actor network µ a , based on the rigid and non-rigid actions to be estimated. This improved network stability during training and reduced the overall computational complexity. The network architecture is depicted in Fig. 2.
Value network. In order to determine the best action a * t in the predicted action set A t using the actor network µ a , such that the sum of current and expected cumulative future reward is maximized, a value network is trained to estimate the Q-value for each action a t . To this end, we use the Q-learning framework and define the value network as µ v (o t , a t ; θ v ) , where θ v represents the associated hyperparameters. The instant reward at step t is expressed as where v i,t denotes the current estimate of v i,t applying predicted action a t . This indicates, the relative reduction of the distance between estimates and the ground truth. The optimal action-value function Q * is defined using Eq. (1), and the loss function can be found in Eq. (2). To facilitate training of the value network, we used a double deep Q-network (DDQN) 34 In Eq. (13), θ − represents the hyperparameters of the target network. Both networks share the same architecture, but θ − is updated slowly i.e. once every few iterations, while θ v is updated every iteration. This strategy contributes to the overall convergence of the network(s), during training. We determine the boundary condition of the Q-value to facilitate training of the value and target network. The action a t is estimated using the actor, and the magnitude of the action is constrained by the normalization and the acceleration factor β . Accordingly, we approximate the upper bound of the reward function as, r + t ≈ β . Furthermore, we approximate the average action loss as L = 1 2 L g a . This in turn enables us to approximate the lower bound of the reward function, r − t ≈ − √L > −β . As the reward function is bounded, we determine the boundary condition of the Q-value using sequence limit, as expressed in, Here, d max , denotes the maximal distance between the estimates and the ground truth. The value of d max can be calculated according to the data augmentation strategy employed.
Although we have shown in Eq. (14) that the Q-value is bounded, it is not practical to set the number of steps t → ∞ , during inference. This is why we employ two termination criteria for the action sequence chosen by the www.nature.com/scientificreports/ value network. First, we introduce t max as the maximal number of steps in a sequence. This defines the maximum run-time of the algorithm. The second criterion is that the Q value predicted by the value network should be greater than Q min . If the predicted value is less than Q min , we consider the deformation to be insignificant, and terminate the sequence.
Data augmentation. We use data augmentation to generate sufficiently rich data for training the actor and value networks. Data augmentation is applied to both the observation encoding o t and the estimates V t . For the observation encoding, Gaussian noise was applied to the image features. Data augmentation on the estimates V t was performed by deforming the ground truth surface meshes G used for training. In order to ensure that the deformations of the ground truth meshes were realistic, we trained a statistical shape model (SSM) of the target structure, by template-to-sample registration using the coherent point drift (CPD) algorithm. This registration is done iteratively, where in the first iteration, a random sample is chosen to be the template. In each iteration, the template-to-sample registration is performed, and the template is updated by calculating the average model at the end of each iteration. We then deformed the ground truth G using the modes of variation of the trained SSM, followed by the application of random affine transformations. Finally, each vertex was randomly jittered using small displacements, and subsequently, the mesh was smoothed. Note that the point correspondence imposed by the SSM is applied only to the generated training data. The calculation of the optimal action a * t of the demonstrator requires no correspondence between the vertices. Therefore, the deformation estimated by the actor remains independent of point correspondences.
Training. The first step towards training the proposed framework is to train the actor network. To this end, the Dataset Aggregation (DAGGER) algorithm 35 was employed to efficiently sample the high dimensional training data space for a sequential process. The pseudo code for the DAGGER algorithm is presented in Alg 1. In the first iteration of the DAGGER algorithm, the actor is trained using the training data D generated by the demonstrator. Subsequently, new data D π was gathered by applying the actor, and extracting the observation and optimal action set generated using the demonstrator. These data D π are aggregated with the training dataset D , for the next iteration of training. www.nature.com/scientificreports/ This training scheme is directly applicable to the non-rigid action. When training the network in a multitasking fashion for the rigid actions, we need a criterion to select the action a t to further gather training data D π . Here, we select randomly one of the predicted translation, rotation and scale as a t .
Following this initial training step, the actor and value networks are jointly trained using the Deep Deterministic Policy Gradients (DDPG) approach 28 . The pesudo code of the DDPG algorithm is presented in Alg. 2. Here, the value network (referred to as the critic-network in 28 ) is trained using the DDQN algorithm with an ǫ -greedy action selection scheme. The gradient of the actor can be calculated using the chain rule to compute the derivatives of the associated hyperparameters, and update them accordingly. As our actor network is pre-trained using imitation learning, we update our actor at a much slower rate than the value network.
Network architecture. The proposed architecture for the actor and value network is shown in Fig. 2. Both networks have shared layers designed to extract high-level features from the encoded observations. This sharedlayer architecture was inspired by the PointNet 36 , where a dense layer was used to extract vertex-wise local features. These features are merged using a pooling layer, enabling inter-vertex global features to be learned. The extracted global features are permutation invariant, and therefore do not pose any restrictions on the structure of the neighbourhood. The global feature is repeated and aggregated with the local vertex-wise features using skip connections, as done in the U-Net. The output layer for the actor network, predicts the non-rigid action as a deformation vector for each vertex, and the affine action in the form of an affine transformation with nine degrees of freedom. Consequently, the output layer is constructed to fit the outputs of the actor, where, for the rigid action, an additional pooling layer is used to encode each action (i.e. rotation, translation and scaling) as a vector. For the value network, the predicted actions are considered as additional input and concatenated with the global inter-vertex features extracted using the shared network architecture. The outputs of the value network are the Q-values estimated for the associated actions.

Results
We evaluated our method for two different tasks, namely, segmentation and registration. For the former, the performance of our approach was evaluated for mono-and multi-modal liver segmentation and compared to a state-of-the art approach, namely the 3D U-Net. In addition, the registration performance of our approach was evaluated for the challenging task of intra-operative brain shift compensation based on multi-modal images. Furthermore, we investigated the effect of the acceleration factor β used when training the value network, assessed www.nature.com/scientificreports/ the convergence of the algorithm, and analyzed its computational complexity. All computations during both training and evaluation were performed on the same machine using an NVIDIA GTX 1070 GPU.

Mono-modality image segmentation.
In the first set of experiments, evaluating mono-modal segmentation performance, we trained our framework using the Visceral Anatomy3 37 CT datasets, comprising 20 CT volumes with ground truth segmentations for the liver. The dataset was partitioned into training, validation, and test sets, using a split of 16 : 2 : 2. The 3D U-Net was used as a baseline for comparison with our approach. To this end, it was trained and evaluated using the same samples. All volumes were normalized such that their intensity values are within the range of [0, 1]. Due to clinical runtime requirements and hardware restrictions imposed by the target application, the input to the 3D U-Net had to be resampled. To this end, we cropped the original volumes to ROIs centered around the liver, and resampled the ROIs to a resolution of 128 × 128 × 128.
A DICE loss function, as proposed in 15 , was used to train the 3D U-Net. In contrast, for the proposed method we could use the volumes in their original resolution and no cropping was performed. As described above, we augmented the data when training the actor and value networks. This was done by first training an SSM of the liver using each training sample (via template-to-sample registration using CPD). Subsequently, the modes of the SSM were used to generate realistic deformations of the training samples for augmentation. We sampled the image feature along the normal vector of each vertex using 1 mm equal distance over 61 points to extract the gray values. During testing, we placed the template, reshaped using random modes of variation, at different positions in the volumes. Displacements in the rage of [−40, 40] mm (i.e. approximately half the size of the liver ROI) in each direction were used. We evaluated both methods by performing a ten-fold cross-validation, using the Anatomy3 CT dataset. In addition, we tested them on unseen samples from the Visceral Silver-Copus CT dataset 37 . The DICE coefficient was used as the metric to evaluate the performance. When comparing results, we state the average (AVG) ± standard deviation (STD) of this metric. The results for both approaches over test samples from the cross-validation experiments, and the unseen samples from the Visceral Silver-Copus dataset, are summarized in Table 1.They show that the proposed method managed to outperform the 3D U-Net despite comprising significantly fewer parameters (2.7 million as opposed to 16.1 million for the 3D U-Net). Furthermore, the training of the 3D U-Net took roughly 72 h while the proposed method was trained approximately for 12 h.
Multi-modality segmentation. In the multi-modality segmentation task, we evaluated the proposed method on two clinically relevant scenarios. First, we used the models trained in the previous set of experiments (refer to "Mono-modality image segmentation" section) for liver segmentation in single-energy CT volumes, and applied them to volumes acquired at different clinical centers, where different scan protocols were followed to acquire the CT data. This included contrast enhanced CT (CECT) from Visceral Anatomy3, single energy CT from 3D-IRCADb-01, and dual energy CT (DECT) data collected by our clinical partner. For all datasets, we only normalized the intensity as introduced in section 3.1. No further pre-processing was used. To evaluate the robustness of the algorithm, we did not use transfer-learning (i.e. fine-tune the networks) at this stage. We evaluated for each CT volume in each dataset all ten trained models from ten-fold cross-validation performed in "Mono-modality image segmentation" section. This was done for both proposed method as well as for the 3D U-Net. Finally, we documented the AVG ± STD DICE score. The segmentation results achieved by the proposed approach and 3D U-Net, for all three types of CT datasets tested, are summarized in Table 2. The results show that the DICE score of the proposed method was consistently and significantly higher than what could be achieved with the 3D U-Net.
Second, we evaluated our model using the multi-modal CHAOS challenge dataset. This dataset comprises a total of 40 contrast enhanced CT (CECT) volumes, 40 T1 in phase MRI volumes, 40 T1 out phase MRI volumes, and 40 T1 SPectral Attenuated Inversion Recovery (SPAIR) MRI volumes of the abdomen. We trained our network using all modalities simultaneously, without incorporating modality-specific prior knowledge. This is a challenging task as the appearance and spatial resolution of the data vary significantly. To deal with the large differences in appearance between MRI and CT, we pre-processed all volumes using a Laplacian of Gaussian filter (LoG). As depicted in Fig. 3, the LoG filtered images are relatively similar in appearance, unlike the original images. Each volume was then normalized to unit variance and zero mean. The rest of the pipeline of the proposed method remained the same, despite the differences in spatial resolution of the images, as all steps in the proposed method were in the world coordinate system using mm as unit instead of the number of voxels. We trained the networks using this dataset, then evaluated the performance by applying a ten-fold Table 1. Mono-modal liver segmentation accuracy, trainable parameters (in millions), and run-time comparison for the Anatomy3 and SilverCopus datasets. The number of parameters for the proposed methods is the sum of actor and value networks.

Dataset
No. www.nature.com/scientificreports/ cross-validation. In the ten-fold cross-validation, we formed 40 sets, each comprising a volume of each modality. In other words, each set included one CECT volume, one T1 in phase volume, one T1 out phase volume, and one T1 SPAIR volume. The ratio between training, validation and testing for these 40 sets was 32:4:4. As shown in Fig. 3, using the proposed method, a single trained network can segment input volumes of different modalities. The segmentation performance is comparable for all different modalities despite the difference in appearances and spatial resolutions.

Registration.
For the registration application, we picked brain-shift compensation 32 . The network was trained and cross-validated using the Correction of Brainshift with Intra-Operative Ultrasound (CuRIOUS) MICCAI challenge 2018 training dataset. In total 20 datasets were used. They were partitioned into training, validation, and testing, using a split of 16:2:2. In this setting, we encoded the observation by sampling a 7 × 7 × 7 sub-volume centered at each landmark without any preprocessing step and relied on the actor to estimate associated deformation vectors. Since the volume was registered to the pre-operative data via an optical tracker on the ultrasound probe during the intervention 38 , we only used the non-rigid action in this setting. Therefore, we employed only the DAGGER to train the actor but no value network was involved. The actor network architecture was the same as illustrated in Fig. 2 with small variations in the input layer due to difference in observation encoding. We further changed the base number of feature kernels from 64 to 32. As the number of landmarks  www.nature.com/scientificreports/ varied in the data, we introduced a constant, N, chosen such that the maximum number of landmarks could be accommodated. In case of data with fewer landmarks, random landmarks were copied in the input. This random copying had no impact on the network, both, during training and testing phase, as the vertex-wise deformation predicted by the network and the max pooling value remains the same in case of duplicated values. To evaluate the performance of the proposed method, we used ten-fold cross-validation and applied the trained networks to the landmarks in the test set. Then we calculated the mean target registration error in mm between estimated and ground truth position of the all landmarks. We used the NiftigReg method 39 as baseline for the registration. The registration result is shown in Table 3, and examples of the MRI overlaid with registered iUS frames are depicted in Fig. 4. As we can see, the proposed method compensated the deformation introduced by brain-shift well.
Computational complexity. To analyze the computational complexity, we consider the trained networks as O(1) as its parameters are fixed. Our method has a complexity of O(n · m) for both actor and value network, where n denotes the number of vertices and m denotes the number of features, independent of the 3D volume resolution. In contrast, 3D U-Net and its variations/extensions have a complexity of O(h · w · d) , where h denotes the height, w denotes the width and d denotes the depth of the volume all in voxels (3D matrix size). Therefore, for an average case with multiple slices (100-300 slices for an abdomen CT scan and a in-plane matrix size of 512 × 512), our method is less computational demanding. In terms of time complexity, the processing of an input mesh with 1K vertices, the actor and value network require approximately 1 GFLOP regardless of the volume size. At the same time, the base-line 3D U-Net for an input volume size of 128 × 128 × 128 requires 70 GFLOPs.
Speed-up factor β. We analyzed the actor network w.r.t. the key variable, the speed-up factor β . We chose the liver segmentation as our target application and investigated the impact of this parameter on the Anatomy3 data. To facilitate the investigation, we chose the actor network for non-rigid registration to evaluate the impact on performance and general convergence using different β values. We trained the actor network using different speed-up factors β and evaluated it on one set of test data to investigate its overall impact to the segmentation performance and general convergence. For a more comprehensive comparison, we also included the one-shot estimation and one-shot with learning rate. The actor network in one-shot approach learns directly the optimal action while one-shot with learning rate learns the optimal action multiplied with the learning rate. To study the effect of different values for β , we evaluated multiple trained actor networks associated with different β values Table 3. Evaluation of the mean distance between landmarks in MRI and ultrasound before and after correction. www.nature.com/scientificreports/  We also compare results obtained using one-shot prediction (one_shot), vs one-shot with learning rate (one_ shot_lr). www.nature.com/scientificreports/ and ploted the mean vertex distance of the test set over 50 time steps. As shown in Fig. 5, the value of β was directly correlated to the convergence speed of the algorithm over the time step t.

Discussion and outlook
Our results suggest that the proposed method provides good accuracy and robustness for segmentation and registration applications in single and multi-modal datasets. While the accuracy for each specific scenario could be further improved by using specific pre-processing or post-processing methods, we refrained from this, as we wanted to keep our approach general and flexible. In addition, different data splits and other baseline methods could be used to find out how the use of other training datasets impacts the proposed method relative to others. This is, however, part of future work as the number of suitable training datasets currently available to us is limited.
Since the proposed method samples the observation encoding in the volume using absolute dimensions (mm) instead of voxels as base unit, the algorithm can be readily applied to datasets having different voxel sizes. In particular, no resampling during both training and application phase is required. This property is especially useful for multi-modal applications where spatial resolution may vary across different imaging modalities. Also note that, our method requires a proper computation of an affine transformation matrix between the Right, Anterior, Superior (RAS+) coordinate system and the voxel coordinate system. In case of clinical datasets, this matrix can be retrieved or computed using Neuroimaging Informatics Technology Initiative (NII) or DICOM header files. However, in some segmentation challenge dataset, this relevant information is missing or misleading. In this case, the proposed method will fail, while methods only relying on voxel data might still work.
In the multi-modal segmentation application we have found that the proposed method still achieves good performance on CECT and DECT datasets, even when trained on only one type of CT data. On the other hand, the 3D U-Net struggled with different modalities. One plausible explanation is that the proposed method does not learn directly from the raw intensity values, whereas the 3D U-Net does. This makes the change of the intensity values found within the liver, due to contrast agent and different energy levels, detrimental to the performance of the 3D U-Net. Such changes have less impact on our method, as we learn actions leading to a correct solution. This robustness was confirmed during the experiments involving the CHAOS dataset, where LoG pre-processing was used.
We found that large values of β yielded faster convergence but worse accuracy, while small values of β resulted in slower convergence but higher precision. As expected, the value of β is a trade-off between speed and the accuracy of the network. Using one-shot, the network diverged after the first iteration. By introducing a learning rate, we can clearly see that the algorithm stabilized over time.
The introduction of the action normalization and the speedup factor β plays an important role in the training process. Although it has been shown that the registration can be done using CNN in a one-step approach 40 , this may not be the optimal approach. If we consider image registration as a sequential energy minimization process, the one-step approach corresponds to an algorithm finding the solution using a very large learning rate. Such an algorithm may, however, not converge to a minimum, as shown in Fig. 5. In addition, it may not generalize well when applied to unseen datasets 41 . By normalizing the action, we are effectively setting the step size of the algorithm to a fixed number. This strategy outperformed a fixed learning rate due to a combination of feature encoding and the optimization of the loss function. In consecutive steps, the feature encoding remains rather www.nature.com/scientificreports/ similar, but the magnitude of the action is constantly changing (due to the learning rate). As a consequence, the L2 loss function in this ill-posed problem is effectively trying to find an average magnitude of the actions. As all point correspondences are modeled only implicitly, another potential application of our method is to register computational phantoms, e.g., the Visible Human (VisHum) 42 to CT volumes. This warping propagates the dense segmentation in the phantom to the CT volume and can facilitate new clinical application, e.g., patient X-ray dose estimation. In such a case, we train our agent e.g., on lungs, liver, spleen and kidneys. To register the Visible Human to a CT dataset, we can use the segmentation of the Visible Human as initialization and then estimate associated deformation vectors of the organs. Afterwards, a dense segmentation field can be interpolated using these sparse deformation vectors via B-spline interpolation. Using the deformation field, we finally warp the Visible Human to an arbitrary CT volume. An example of this application is shown in Fig. 6. As we can see, the visual propagation of the dense segmentation is quite accurate, and our approach can accomplish this in seconds.
In sum, due to its flexibility, robustness, and relatively low computational complexity, the proposed method provides a promising multi-purpose segmentation and registration framework, particular in the context of image-guided interventions. In subsequent work, more clinical applications will being investigated Received: 14 June 2020; Accepted: 14 January 2021