CLAIRE-ROP: Rapid Partitioning-based Deformable Image Registration on Multi-GPU Accelerator

Deformable image registration (DIR) is an important tool for clinical applications, especially in medical imaging and radiotherapy, as it ensures accurate image alignment and analysis. Given the high computational demands, acceleration of DIR is essential to ensure efficient and timely medical procedures. The use of a multi-GPU implementation holds the potential for significant DIR acceleration. However, the challenges posed by high communication times and increasing complexity have made using multi-GPU configurations impractical in the clinical setting. In this paper, we propose a novel solution, called CLAIRE-ROP (Rapid Overlapped Partitioning-based) to address these challenges to develop an efficient multi-GPU implementation. Our approach incorporates an efficient partitioning scheme that divides lung images into multiple partitions and allows individual registration for each partition without compromising accuracy. Our method successfully registers images from the largest openly available lung dataset (512 × 512 × 136) in less than 0.5 seconds with a Dice score of 0.991. Our code is available at https://github.com/UniHD-CEG/CLAIRE-ROP.


INTRODUCTION
Image registration (IR) is a fundamental task in many clinical applications, involving the alignment of two medical images.In this process, one image serves as a reference, and geometric transformations are applied to the other images, referred to as moving images, to achieve optimal spatial alignment of anatomical structures.Since anatomical changes are often deformable, deformable image registration (DIR) is used in many medical applications.
DIR is a type of image registration known for its complexity in process and validation.This complexity arises from the need to model and compute a deformation field that maps one image onto another.This deformation field uses mathematically calculated algorithms and processes that describe the spatial transformation between the two images.Due to the intricate nature of this process, determining the deformation field can be computationally demanding and may necessitate the use of specialized hardware and software.
Although DIR methods have achieved significant success in registration accuracy, they require intensive optimization calculations and task-related parameter tuning, leading to substantial computing resources and time requirements.This limits medical applications that are highly time-constrained, such as image-guided adaptive radiation therapy, where images must be continuously registered while the patient is breathing.Several studies have proposed DIR methods to achieve a speedup by utilizing different accelerators such as graphics processing units (GPUs) [3,12,13,18,19,23,26,30].A few authors have also proposed a multi-GPU-based DIR framework to accelerate the registration time [2].Overall, none of them are capable of achieving real-time DIR.In recent years, deep learning (DL) methods have gained increasing attention, and many DLbased methods are being applied to DIR [6-9, 14, 20, 25, 27].They can considerably speed up computational time.However, model generalization poses a major challenge, as different DL models must be trained individually for each application.Furthermore, the performance of DL models on unseen clinical datasets remains uncertain.
In this paper, we explore the multi-GPU implementation of the constrained large deformation diffeomorphic image registration (CLAIRE) framework [2].While CLAIRE was primarily developed for brain images, our findings reveal its potential as a reliable and accurate solution for lung registration.It is important to note that although their multi-GPU approach has succeeded in improving the runtime for large-scale imaging problems, CLAIRE's capabilities come at a high communication cost for actual-size clinical lung datasets.To enable CLAIRE to achieve real-time registration with clinical datasets, we propose a partitioning-based registration by dividing lung images into multiple partitions, thereby enabling individualized image registration for each partition while preserving accuracy.This division is done to leverage the computational power of multiple GPUs, as each GPU works independently on its assigned partition, resulting in reduced processing time.As we made sure that there are no dependencies among different partitions, multiple partitions can be registered simultaneously and in a fully parallel way.As the number of available GPUs increases, we can seamlessly adapt by assigning more partitions to effectively utilize the additional computational power.This strategy, therefore, eliminates the need for communication between GPUs and improves scalability.
Throughout this study, we focus on registering 3D computed tomography (CT) lung scans as one of the most deformable and complex medical datasets due to their elasticity and the influence of respiratory motion during respiration.However, our method can register images from various regions of interest and extends beyond the medical imaging domain.We provide an analysis of the behavior of our method by partitioning images in various settings.Our study includes an empirical analysis that examines the impact of partition size and the number of GPUs on registration accuracy and time.Moreover, it offers a flexible and adaptable framework applicable to other image registration tasks beyond CLAIRE.In conclusion, our approach enables faster image analysis with a wide range of potential applications in various fields.It is especially beneficial when dealing with large datasets and computationally intensive registration tasks.In summary, this paper makes the following contributions: • Multi-GPU framework for lung registration (CLAIRE): Building upon an established brain registration framework for CLAIRE, we propose the use of CLAIRE as a promising framework for registering lung images.This framework significantly reduces processing time while preserving high registration accuracy.To achieve optimal results, we perform thorough parameter tuning, systematically identifying the most suitable configuration for the lung dataset.• Rapid Overlapped Partitioning-based registration (CLAIRE-ROP): To further accelerate the registration process, we propose a partitioning scheme for the lung dataset to enable individual registration for each partitioned image set on a dedicated GPU.This eliminates data exchange among the GPUs, resulting in improved overall performance.• Evaluation of total execution time: This study comprehensively evaluates total execution time, encompassing pre-and post-processing stages, compared to baselines.
By presenting these contributions, we aim to offer a substantial advancement in lung image registration techniques, paving the way for more efficient DIR in clinical applications.
The rest of the paper is organized as follows.Section 2 provides background information, and Section 3 discusses related work.Section 4 evaluates CLAIRE's performance and limitations.Section 5 outlines our method.We discuss insights into total execution time in Section 6. Lastly, Section 7 draws discussion for potential future work.

BACKGROUND
Lung cancer is the second most commonly diagnosed cancer and remains the leading cause of cancer-related deaths [21].Radiation therapy (RT), an important treatment for lung cancer, uses high-energy radiation to eliminate cancer cells and stop their uncontrolled growth.However, the movement of lung tumors during natural expansion and contraction with each breath presents a major challenge.Without proper image registration, these respiratoryinduced anatomical changes can compromise the dose distribution during irradiation [11].In Figure 1, we show the intensity difference between two distinct respiratory phases: the end-inhalation phase and the end-exhalation phase, from an axial perspective.This visualization highlights the need for a deformable registration step.
Like most existing algorithms for deformable registration, CLAIRE iteratively optimizes a transformation based on a similarity measure.CLAIRE uses an optimal control technique to solve the optimization problem by seeking a velocity field  () that produces the desired spatial transformation on a three-dimensional rectangular domain Ω := [0, 2) 3 ⊂ R 3 with periodic boundary conditions on Ω.Let  denote the spatial coordinate; ( 1 ,  2 ,  3 )  ∈ R 3 , and (, ) is the state variable that transports intensities of  0 .The optimization problem with two images  0 () (moving image, image to be deformed) and  1 () (reference image) can be stated as: subject to The first term in (1) measures image similarity ( 2 -distance) between the deformed moving image (, 1) and the reference image  1 ().The second term in (1a) is a regularization term with a regularization trade-off parameter  > 0 to ensure the smoothness of the velocity field.This, in turn, ensures that the resulting geometric transformation of the moving image retains the properties of the diffeomorphism.The transport equation (2) describes the geometric transformation of the moving image  0 ().A reduced-space Gauss-Newton-Krylov method is used to solve (1).
In RT, time sensitivity goes beyond efficient treatment planning.Real-time adaptive treatment and image-guided interventions during procedures greatly benefit from rapid DIR, enabling immediate feedback and adjustments [16].Among the various approaches aimed at accelerating DIR, GPUs emerge as the most compelling choice.Since DIR is a data parallel task and deformation vectors at individual voxels can be computed almost independently at each iteration step, GPUs are exceptionally well-suited for this application.However, achieving real-time deformable registration computations on a single GPU remains a challenge despite recent advancements in GPU technology.To tackle this challenge, a few studies leverage the power of multiple GPUs, distributing the computational load across multiple GPUs to enhance parallelism.While adding more GPUs increases parallel processing capabilities, it also raises the complexities of interactions between GPUs.Consequently, a multi-GPU utilization setup becomes less practical when dealing with clinical datasets.The concept of overlapping communication and computation serves as a technique to mitigate the impact of communication delays.However, applying overlapping techniques smoothly can be challenging for some mathematical tasks, such as interpolation and Fast Fourier Transforms (FFTs), which are the most significant contributors to the computational cost of deformable registration.These tasks involve data dependencies that complicate the process or prevent overlapping to a large extent.
In this paper, we evaluate strong scaling to measure how effective parallelization is in terms of the number of processors used.Strong scaling increases the number of processors while keeping the problem size constant, in contrast to weak scaling, which increases both the number of processors and the problem size.The strong scaling speedup is defined as the ratio of the time required to complete a task of size  with a single GPU  (1, ) to the time required to complete the same unit of work with  processing components (parallel tasks with multiple GPUs),  ( , ):

RELATED WORK
In this section, we explore studies that attempted to accelerate DIR using various hardware and software solutions.

CPU Implementation
In the past decade, a wide range of methods for robust image registration has been proposed.These methods have achieved great success in registration accuracy [10,15,24].However, they often require a long execution time, even on multi-core high-end CPUs.Popular CPU implementation packages for deformable registration are Demon [28], Elastix [17], and Advanced Normalization Tools (ANTs) [1].CLAIRE, introduced alongside [22], includes a CPU version featuring a distributed-memory implementation of an effective solver for large deformation diffeomorphic image registration in three dimensions.CLAIRE has been optimized for scalability on standard x86 CPU clusters, utilizing the Message Passing Interface (MPI) for parallelization.The authors have demonstrated its capability to address registration problems involving clinically relevant data sizes, achieving impressive results.By utilizing 20 cores on a standard computer node, CLAIRE completes registration tasks within two to four minutes, ensuring data fidelity.Notably, they accomplished a registration time that is almost competitive with the Demons algorithm on the same system, while achieving superior registration quality.However, it is crucial to acknowledge that DIR poses significant computational demands, making CPU-based processing impractical for clinical use.In recent years, GPUs have been employed in a wide range of RT applications, where they have achieved remarkable success in accelerating computationally demanding challenges.

Single GPU Implementation
GPU-accelerated lung DIR has been implemented in [12,13,18,19,23,26,30], achieving computational acceleration compared to the CPU implementation while maintaining similar levels of registration accuracy.For example, in [30], the authors accomplished an impressive runtime of around 6 seconds, setting a record as the fastest completion time within the Case 1 to Case 4 DIR-Lab dataset.In particular, a GPU implementation of the Demons algorithm is detailed in [13,23], showcasing substantial acceleration compared to its CPU implementation.However, a notable challenge in these studies is the absence of publicly available implementations, preventing replication of their results.In a recent development, CLAIRE has been ported to a single-node, single-GPU implementation, incorporating highly optimized, mixed-precision GPU kernels for scattered-data interpolation evaluation [3].Although CLAIRE was initially designed for brain image registration, our discussion highlights its robust potential for lung datasets, and the code is publicly accessible.
Nevertheless, despite leveraging the computational capabilities of GPUs in DIR applications, it is worth noting that the registration process still leads to lengthy execution times, as the high computational demands of a registration exceed the capabilities of a single GPU.

Multi-GPU Implementation
Research on multi-GPU-based image registration for RT applications has been relatively limited.This limitation is due to the complex challenges related to data partitioning, memory allocation, and data movement.The absence of a unified approach to overcome these challenges has resulted in clinical RT applications staying behind research efforts in computer science.
Apart from studies focusing on lung datasets, the authors in [2] proposed a novel and highly optimized multi-node multi-GPU implementation of CLAIRE.This implementation leverages direct device communication for fundamental computational kernels, such as interpolation for solving transport equations, high-order finite difference operators for differentiation, and FFTs.Additionally, they introduced algorithmic enhancements that resulted in noteworthy performance improvements.

MULTI-GPU FRAMEWORK FOR LUNG REGISTRATION
Given the promising results of multi-GPU DIR for brain images based on CLAIRE, this section is concerned with assessing whether this method can also be applied to lung images, which differ substantially in their characteristics from brain images.To this end, we consider an extensive parameter search for optimal execution, before reporting performance and accuracy for single-and multi-GPU experiments, comparing both metrics to a well-chosen baseline.to find an optimal value that would be consistent across all datasets.Our goal was to find a balance between the Dice score and time.Through our investigations, we identified 5e−3 as the optimal value for this trade-off.

Optimal Setting for Lung Registration
To determine the optimal configuration for the lung dataset, we performed extensive parameter tuning.We found that a regularization parameter continuation scheme was effective, using a target parameter  = 5e−3 for all resolutions, which differs from the one used in [2] (5e−4).This scheme involves solving the registration problem for a sequence of decreasing values for .For each new value, the velocity obtained in the previous step serves as the initial estimate for the Gauss-Newton-Krylov solver.In Figure 2, we present our exploration of regularization parameters to achieve consistency across both small and large datasets in lung image registration.
For the other parameters, such as the distance measure and the optimization method, we found the settings recommended by the authors as optimal.
In addition, it is worth noting that medical images can have different orientations and patient positioning, which can have a significant impact on the registration process.Therefore, ensuring consistent orientation and alignment is critical before initiating the registration process.We have found that the right/left (RL), anterior/posterior (AP), and superior/inferior (SI) orientation formats are most effective in solving the lung registration problem, as shown in Figure 3. Interestingly, although deformable registration strategies often begin with an initial affine transformation for global alignment and are then followed by a more complex deformable transformation, our findings suggest that affine preregistration is not necessary for lung datasets, simplifying the registration pipeline.

Experimental Setup
We evaluate performance using the 4DCT DIR-Lab dataset 1 , one of the most commonly used datasets in image registration studies [4,5].
The dataset includes ten 4D lung images labeled as Case 1 to Case 10.Each image contains ten 3DCT phases (T00 to T90), providing different stages throughout the entire respiratory cycle.Specifically, we chose phase T00, corresponding to the end of inhalation, as the reference image, and phase T50, representing the end of exhalation, as the moving image in the registration process.We classify Case 1 to Case 5 as our small dataset and Case 6 to Case 10 as our large dataset.The selection of these phases is based on their widespread use in prior studies, ensuring easy comparability of our results.Furthermore, these two phases hold crucial significance as they depict the most substantial deformations in the lung.We assess whether the deformation field of the lung can compensate for respiratory motion.To this end, we calculate the Dice score, also known as the dice similarity coefficient (DSC), for each registration method, which measures the overlap between reference and registered binary masks.Binary masks of grayscale CT images before and after registration were obtained by automatic segmentation with TotalSegmentator [29].Given two segmentations A and B, the Dice score is calculated as

Experimental Results
In Table 1, we present the performance comparison between the single GPU-based CLAIRE and ANTs as a well-known DIR method.We selected ANTs as the baseline method due to its extensive use in medical imaging for various applications, including RT planning and image-guided surgery.To ensure a fair comparison, it is important to note that we used the optimal configuration of the ANTs package to achieve the most accurate registration.We examined methods with and without the use of binary masks in the registration process.The results show that ANTs performs better when mask is included in the registration process, while CLAIRE shows the opposite behavior.Thus, we calculated the Dice difference and speedup factor as the ratio of registration time obtained by ANTs with mask to that of CLAIRE without mask.Remarkably, CLAIRE achieves equivalent or higher accuracy and significantly faster registration times.It achieves a speedup factor of 120 on a single GPU system in the small dataset and 61 in the large dataset compared to the ANTs' CPU-based implementation.Overall, this comparison highlights the efficiency and accuracy that CLAIRE brings to the complex process of nonlinear lung image registration, making it a valuable tool in medical imaging and research.While single GPU results for CLAIRE are noteworthy, it still encounters challenges when dealing with computationally intensive tasks in DIR, emphasizing the critical need for faster registrations.Multi-GPU setups excel in this context due to their parallel processing capabilities, proficiently distributing workloads across multiple GPUs to reduce computation time and meet the need for increased efficiency.However, in contrast to its impressive single GPU performance, CLAIRE's scalability is limited due to significant communication costs and load imbalances between ranks when dealing with clinical problem sizes.We investigated the strong scaling behavior of CLAIRE and present the results in Table 2.It can be seen that CLAIRE does not scale well with the number of GPUs.To gain deeper insight, we performed a profiling analysis.In this analysis, we report the fraction of time spent on computation time (CUDA kernels) and data transfer time (memory operations).Figure 4 shows that CLAIRE significantly reduces computation time, but its scalability suffers from the high communication cost for real problem sizes.This highlights that the communication overhead becomes a bottleneck and dominates the overall execution time.Note that the Dice score remains consistent regardless of the number of GPUs used.In this work, our goal is to achieve real-time capabilities for problem sizes that are relevant in the clinical setting.This goal requires us to eliminate or minimize the time spent on data transfer.In the following section, we outline our approach to improving scalability.2, which was the total runtime of the entire solver.These results highlight data transfer time as a potential bottleneck.

RAPID OVERLAPPED PARTITIONING-BASED REGISTRATION
The DIR is a multi-step process, which is also reflected in our approach.These steps include: (1) Pre-processing: We present a standardized partitioning technique that accommodates the anatomical variations in lung images, including diverse sizes and shapes.This step is detailed in sub-section A (partitioning method) and sub-section B (addressing boundary effects).
(2) Processing: This partitioning technique enables parallel processing for individual registrations.This is a significant benefit for efficiency.In this core step of DIR, deformation fields are computed to map moving images to reference images.
The number of moving and reference images corresponds to the number of partitioned images.(3) Post-processing: In the post-processing step, deformed partitions from separate registration processes are merged to create a complete deformed image.This step is discussed in more detail in sub-section C.
These three steps are integral to our deformable image registration process.In sub-section D, we report the experimental results to show the effectiveness and performance of our approach.

Partitioning Method
Image partitioning technique is commonly used in image processing to analyze local image content.However, we use this technique to improve parallelism and thereby reduce registration time.In the 4DCT dataset, we encounter two distinct pixel dimensions: 256×256 for Case 1 to Case 5 and 512 × 512 for Case 6 to Case 10, representing the X and Y dimensions.The Z dimension, indicating the number of slices, varies in these datasets depending on the patient, specific imaging protocol, or clinical indication.This variability is a common characteristic of CT imaging, as it represents the depth of the imaged volume, which may vary depending on clinical requirements.Therefore, in our approach, we employ a consistent 2D axial viewpoint for image partitioning considering X and Y dimensions that are uniform for all images.
Our approach involves a combination of column-wise and rowwise partitioning methods chosen for their suitability to the typical lung shape.We refer to column-wise partitioning when the images are partitioned vertically and row-wise partitioning when images are partitioned horizontally.The partitioning process begins with column-wise partitioning, which divides the images into two primary partitions, isolating the left and right lung regions from an axial perspective.We then apply row-wise partitioning to each of these initial partitions, resulting in four different subdivisions: upper left, lower left, upper right, and lower right partitions.This partitioning strategy draws inspiration from the anatomical structure of the lung lobes.We continue the column-wise partitioning within each of these four partitions and further subdivide them into a total of eight partitions.These subdivisions consist of four inner and four edge partitions, effectively completing the partitioning process.We opted for column-wise partitioning in the last phase to increase efficiency by preventing the registration of edge partitions in some datasets.The reason behind this choice is that a higher number of partitions increases the probability that edge partitions do not contain lung tissue, leading to unnecessary registration processes within these partitions.Therefore, our first approach in the final phase is to examine the edge partitions to assess their lung tissue content.If it is determined that all the edge partitions do not contain lung tissue, we proceed to register only the four inner partitions.By examining the edge partitions first and excluding them from registration if they do not contain lung tissue, we can potentially save computational resources, especially for large datasets.
To identify the optimal number of partitions for each dataset, we establish a threshold that maintains the same level of accuracy as the baseline methods (ANTs and CLAIRE).Essentially, we determine the maximum number of partitions that can yield the fastest registration while preserving registration accuracy for various data sizes.Subsequently, guided by the optimal partition number that ensures the fastest registration without sacrificing accuracy, the available number of GPUs, the problem size, and the time sensitivity of the application, CLAIRE-ROP proceeds to employ three different partitioning layouts.

Addressing Boundary Effects
One of the challenges associated with image partitioning is the potential clinical impact of missing data, particularly at the boundaries of the partitions.The significance of missing lung area can vary depending on the application.In certain applications, even a small missing portion of the lung may not have a significant impact on the overall outcome.However, in scenarios such as RT planning or accurate disease assessment, the missing area can be critical.In RT, accurate alignment of the lung area is essential for precise treatment planning.Even a minor omission in the lung area can result in inadequate tumor alignment, ultimately impacting treatment outcomes.
To eliminate the effects of missing lung areas, we considered image partitioning techniques to ensure comprehensive coverage during the image registration process.To achieve this, we implemented overlapping partitions to ensure that critical features and patterns were not missed during registration.This was achieved by sharing common regions between adjacent partitions.To determine the optimal number of additional pixels required for the most accurate Dice score, we conducted experiments in which additional pixels were added to the edges of the image.Specifically, we tested pixel increments of 2, 4, 8, and 16.Our findings show that adding only 8 pixels to the partition edges ensures the inclusion of all lung areas with only a slight increase in registration time.Using this approach, we can create well-controlled partitions without missing lung tissue during the registration process.Figure 5 shows an illustrative example of our partitioning scheme with an 8-pixel overlap.While we apply overlaps to all edges, it is important to note that the inclusion of lung tissue within vertical overlaps can vary depending on the dataset.In this example from Case 6, lung tissue within the vertical borders is absent, which is consistent with all other slices in this case.However, our observations highlight the importance of vertical overlaps in Case 8. To develop a standardized approach applicable to each image set, we consider overlaps for both the horizontal and vertical borders.

Multi-GPU Parallel Registration for Image Partitions
Building on this partitioning approach, we perform separate registrations for each partition, adjusted to account for overlap.Since the registration of each partition is performed independently of the others, we can perform these registrations in parallel to leverage the processing power of multiple GPUs.This effectively eliminates GPU interactions identified as a communication bottleneck in the previous section.As a result, our method doesn't necessitate direct GPU-GPU communication support, and an GPU interconnect not required to achieve optimal performance.Once the individual registrations for all partitions are completed, we merge the deformed partial images to obtain the complete deformed image.We outline the CLAIRE-ROP pipeline in Algorithm 1, and in Figure 6, we present a CLAIRE-ROP pipeline example tailored for a scenario where 8 GPU resources are available, and the image size is 512×512×130.Our objective in this scenario is to achieve the fastest registration time.To attain this goal, we fully leverage all available GPU resources by partitioning the images into 8 partitions.To the best of our knowledge, our work represents the first instance of deploying the partitioning approach to register large misalignments in medical images accurately.We built our implementation upon Simple ITK, an open-source image analysis tool [31].

Experimental Results
In Table 3, we provide evidence that partitioning the small dataset into up to four partitions and the large dataset into up to eight partitions maintains accuracy without compromise.Our approach for the small dataset involved dividing it into four partitions, each with 136 × 136 pixel dimensions with 8 pixels overlap.The large dataset is divided into four 136 × 136 edge partitions and four 144 × 136 inner partitions, all overlapping by 8 pixels.We found that in Cases 9 and 10, the edge partitions lacked lung tissue, which led us to utilize 4 GPUs to only register the inner partitions.In this work, we refer to the CLAIRE registration associated with our Step 1: Pre-Processing Step 2: Processing  ' ', '  ', '  ', and '  ' denote the number of available GPUs, the reference image, the moving image, and the deformed image, respectively.'' corresponds to the X-dimension size, '' to the Y-dimension size, and '' indicates the overlap.In this example, we assumed that the edge partitions contain lung contents, so we utilized all 8 GPUs.In Step 1, the partitioning method is applied.
Step 2 involves performing parallel registrations, and Step 3 leads to obtaining the complete deformed moving image.
method as CLAIRE-ROP.In Figure 7, we show a visualization of the registration results for Case 6 employing four partitions.
To provide a fair comparison between CLAIRE-ROP and CLAIRE, we extract the fastest registration times for the methods from Tables 2 and 3 and present the corresponding speedup values in Table 4.For the small dataset, CLAIRE achieved the fastest registration time using a single GPU, while our approach achieved the fastest registration using 4 GPUs.For the large dataset, CLAIRE utilized 4 GPUs, while our approach harnessed the power of 8 GPUs.In our approach, the small dataset could be registered in an average of 0.34 seconds with four partitions, resulting in a speedup factor of 2.9.For the large dataset, registration with eight partitions could be completed in an average of 0.46 seconds, yielding a speedup factor of 10.6.Importantly, these performance gains were achieved while maintaining the same Dice score.
We also assess strong scalability improvement.Figure 8 shows the strong scaling speedup achieved by CLAIRE-ROP as the number of GPUs increases compared to the ideal speedup.It is calculated by dividing the runtime of CLAIRE on a single GPU by the runtime of CLAIRE-ROP.It is important to note that we cropped the large images in CLAIRE because the problem size should be kept consistent to compute strong scaling speedup.The cropping technique to remove the background in our large dataset is seamlessly integrated into the partitioning process of CLAIRE-ROP.Our method exhibits acceptable scalability on both datasets.In addition, it shows an excellent improvement in scalability when compared to the limited scalability of CLAIRE as reported in the previous section (Table 2).

Reference image Moving image Overlay
After registration

Deformed image (w overlap) Deformed image (w/o overlap) Overlay
Before registration Although the results show a significant reduction in processing time, it is important to acknowledge that our method adds computational overhead in the pre-and post-processing steps.Therefore, to ensure a fair comparison, the total execution time must be evaluated.Image registration typically involves three main steps: pre-processing, processing, and post-processing.Pre-processing is the initial step consisting of preparing the input images for registration.This may include noise reduction, image normalization, segmentation, and cropping.The processing stage involves the actual registration of the images, where the spatial relationship between the images is estimated.Post-processing is the final stage, involving the refinement of registered images and the correction of any errors or artifacts that may have occurred during the registration process.This step may encompass image segmentation and quality assessment.
In our method, we incorporate the partitioning procedure into the pre-processing step.During the processing step, we perform individual registrations for each partitioned image set.Finally, cropping the registered partitions and merging them is integrated into the post-processing step.In Table 5, we report the total time, which includes all three steps.We present the results for Case 1, representing the small dataset, and Case 6, representing the large dataset.For each dataset, we selected the partitioning layout that yielded the fastest processing time, as explained earlier.Overall, the results show that the efficiency of our method extends to the total time by imposing only a small additional computational load in the preand post-processing steps.It is important to emphasize that all these pre-and post-processing steps are fully automated and do not require manual intervention.

DISCUSSION AND FUTURE WORK
DIR is a crucial step in many medical applications to find the best alignment of anatomy between two images.In this work, we proposed a partitioning scheme to develop a real-time DIR framework for use in RT applications in a time frame of milliseconds.In our partitioning method, we can adjust the number and size of partitions to match the complexity of the images or the available computational Table 4: Comparison of the fastest registration times for each method on small and large datasets, along with the speedup factor.We extracted the fastest registration time for each method from Table 2  To further improve the performance of our method, our next goal is to take advantage of image partitioning for customized image registration.Partitioning provides flexibility in choosing the size and shape of partitions based on the dataset characteristics and the registration task requirements.This customization promises significant progress in optimizing the registration process.As shown in Figure 5, the partitions exhibit varying degrees of lung deformation, some less demanding, others more complex.The inner partition at the lower left takes 0.56 seconds to register, while the edge partition at the lower right takes 0.36 seconds, illustrating significant time differences.These results suggest the feasibility and benefits of implementing a heterogeneous partitioning scheme, where the workload is distributed evenly across available GPUs.In addition, tuning the registration settings for each partition based on the deformation levels can lead to more efficient processing.Furthermore, to maximize the utilization of computational resources, we plan to exploit multi-instance GPU (MIG) support for dynamic GPU partitioning among processes, allowing us to establish specific limits for allocatable memory and available compute resources.We harness this feature to allocate GPU resources for each registration process dynamically, enabling flexible registration with customizable memory allocations.We also aim to assess the performance of our method in multimodality image registrations and its effectiveness in datasets involving tumors.DL-based studies often highlight certain drawbacks of optimizationbased methods for DIR [9,20].Some common drawbacks mentioned include: • Optimization-based methods usually require careful parameter tuning to achieve optimal performance.Often, multiple experiments and evaluations must be performed before the most effective set of parameters can be determined.• The iterative nature of optimization-based DIRs makes them computationally intensive, especially for large image datasets.This demand arises from repetitive spatial filtering on deformation vector fields during optimization.They provide several strategies to perform 4DCT DIR, but the runtime is unsatisfactory.In contrast, DL-based methods provide a remarkable advantage in terms of speed.
However, it is important to emphasize that CLAIRE, being an optimization-based method, does not exhibit these drawbacks: • When applying CLAIRE to CT lung images, we found that it has a low sensitivity to a reduction in , consistent with the results discussed in [2,3].CLAIRE is mostly -independent, as can be seen in Figure 2, where the accuracy shows a slight degradation in the range from 1e−2 to 1e−3.We performed parameter tuning for lung images only once, searching for the optimal values that strike a trade-off between accuracy and registration time.This one-time tuning was performed to streamline the process.• In [2], the authors introduced a preconditioner to keep the number of iterations as small as possible in CLAIRE.When applied to the 4DCT dataset, fast convergence was achieved in a range of only 7 to 9 iterations, and this remained consistent across different resolution levels.
Although DL-based methods have witnessed notable advancements, existing research has often prioritized reporting registration accuracy over computational speed [8].Nevertheless, it is imperative to establish benchmarks for registration accuracy and computational time in image registration.Despite the notable progress made in this domain through these advanced techniques, we underscore a key advantage of CLAIRE-ROP -exceptional speed.Our focus is to provide an overview of the comparison methods used in recent studies, exclusively reporting results using the DIR-Lab 4DCT dataset as a standardized benchmark for assessment.To illustrate our findings, Table 6 presents a comparative analysis, highlighting the registration time.This table also serves as a reference for readers seeking a summarized comparison.Most reported runtimes deviate significantly from the real-time registration, such as [9,27].The most promising method, comparable to our results, is presented in [20]; however, the corresponding code to reproduce the results is not accessible.More importantly, the network accepts small images in the current implementations, limited by the available GPU RAM required to keep the full network in memory.This is a common drawback of DL-based methods.As can be seen, all these studies cropped the images to a rough bounding box of the lungs for the convenience of training.In our work, we only remove the background in the large dataset but preserve the most thorax area.We achieved an average registration time of 0.48 seconds for the large dataset, featuring image dimensions of 512 × 256 in the X and Y dimensions, while maintaining an unchanged number of slices in the Z dimension.
Two supervised methods are presented in [6,25].The challenge with supervised transformation prediction methods arises from the need for known ground truth transformations during network training, making it difficult to artificially generate realistic transformations.
In [6], the reported result of 0.58 seconds aligns closely with our results.However, like many studies, the code remains inaccessible.Furthermore, their method requires resizing images to 128 × 128 × 128, a dimension considered impractically small.
To validate the methodology and assess generalization to different data, [6,9,20,25] employed the trained network to register expiration images to inspiration images from a distinct set of test data -the DIR-Lab dataset.However, a consistent observation in all DL-based studies is the need to carefully select training data to comprehensively address potential future registration tasks.Moreover, significant effort is required for data preparation, and in particular the time spent on pre-or post-processing is often omitted in the reports.In [6], they showed that registration accuracy marginally improves when training on a similar dataset.The average target registration error for Case 2, Case 4, and Case 9 was 1.86 ± 1.17 when trained on different datasets, compared to 1.52 ± 0.85 when trained on the same dataset.
Building upon these limitations, the study by [7] introduces an innovative one-shot registration method for tracking periodic motion in 3D and 4D datasets without requiring extensive training data.However, it is worth mentioning that this approach comes with the drawback of prolonged computational time, reported as 26 minutes.
As of the current moment, our achieved registration time on the DIR-Lab dataset stands out as the fastest among all published DIR methods for 4DCT, including both DL and non-DL-based approaches.Notably, our results showed that our approach maintains a competitive level of registration accuracy.

Figure 1 :
Figure 1: Axial view intensity difference slices between endinhalation and end-exhalation phases, highlighting the need for deformable registration, exemplified by Case 6 and Case 7 from the 4DCT DIR-Lab dataset [4].

Figure 2 :
Figure 2: Optimizing regularization parameters for lung dataset consistency: The top three rows show the small dataset (left: Case 1; middle: Case 2; right: Case 3. The bottom three rows show the large dataset (left: Case 6; middle: Case 7; right: Case 8.) During our experimentation, we adjusted the regularization parameter to explore different values ( = 5e−2, 1e−2, 5e−3, 1e−3) to find an optimal value that would be consistent across all datasets.Our goal was to find a balance between the Dice score and time.Through our investigations, we identified 5e−3 as the optimal value for this trade-off.

Figure 3 :
Figure 3: Optimal orientation formats for lung image registration: Axial and coronal views (4DCT Case 6)

Figure 4 :
Figure 4: CLAIRE strong scaling profiling results with 1, 2, and 4 GPUs: A breakdown of time spent on CUDA kernel execution and memory operations.The chart on the left shows the results for the small dataset (Cases 2 and 3), and the chart on the right shows the results for the large dataset (Cases 6 to 10).Note that the time differs from the time reported in Table2, which was the total runtime of the entire solver.These results highlight data transfer time as a potential bottleneck.

Figure 5 :
Figure 5: Example partitioning scheme is followed by columnwise, row-wise, and column-wise partitioning to divide the image into 2, 4, and 8 partitions, respectively.Overlapping ensures that information from neighboring regions is taken into account during the registration to prevent the loss of lung tissues (Case 6: slice 78).

Figure 6 :
Figure 6: An example of the CLAIRE-ROP workflow.' ', '  ', '  ', and '  ' denote the number of available GPUs, the reference image, the moving image, and the deformed image, respectively.'' corresponds to the X-dimension size, '' to the Y-dimension size, and '' indicates the overlap.In this example, we assumed that the edge partitions contain lung contents, so we utilized all 8 GPUs.In Step 1, the partitioning method is applied.Step 2 involves performing parallel registrations, and Step 3 leads to obtaining the complete deformed moving image.

Figure 7 :
Figure 7: Registration results for 4DCT Case 6 with 4 partitions.The top three rows show the original data (left: reference image (T00); middle: moving image (T50); right: overlay of fixed (green) and moving (magenta) images before registration).The bottom three rows show results (left: deformed image without overlap, middle: deformed image with overlap; right: overlay after registration).Image slice position: 78, Image point of view: axial

Table 1 :
A comparative analysis between CLAIRE and ANTs registration methods, with and without masks.Assessing registration time in seconds and the Dice score, where higher values indicate better performance.Speedup factor and Dice difference are calculated by comparing CLAIRE without mask (w/oM) and ANTs with mask (wM).CLAIRE is implemented on a single GPU, whereas ANTs is implemented on a CPU.The results show CLAIRE's efficiency in medical image registration.
As for the registration time, we report the results obtained on the HoreKa system 2 .HoreKa hosts 167 nodes, each equipped with four NVIDIA A100 GPUs with 40 GB RAM and two Intel Xeon Platinum 8368 CPUs with 76 cores, totaling 152 cores per node.

Table 2 :
CLAIRE strong-scaling results using 1, 2, and 4 GPUs.Registration time is in seconds.In some datasets, the solver failed to achieve convergence when transitioning from 1 to 2 or 4 GPUs.Overall, CLAIRE showed poor scaling.

Table 3 :
A comparative analysis between CLAIRE and CLAIRE-ROP with 2, 4, and 8 partitions.Partitioning up to 4 for the small dataset and up to 8 for the large dataset maintained accuracy.Dice scores in bold indicate accepted accuracy compared to baselines.Note that running CLAIRE on 2 nodes with 8 GPUs exceeds communication overhead, so it is not scalable beyond 4 GPUs.Registration times are in seconds.

Table 5 :
and Table3.Figure 8: Strong scaling speedup factor in CLAIRE-ROP:The speedup factor for each dataset is calculated as the average across all five images.Our method shows reasonable scalability on both datasets.Comparison of pre-processing, processing, postprocessing, and total times (seconds) in different methods.Our method shows a minor increase in computational load during the pre-and post-processing steps.

Table 6 :
Comparison of DL-based DIR methods on 4DCT DIR-Lab dataset Reference Network image size Different dataset for testing?Code available?Supervision Runtime (seconds)