LoDoPaB-CT, a benchmark dataset for low-dose computed tomography reconstruction

Deep learning approaches for tomographic image reconstruction have become very effective and have been demonstrated to be competitive in the field. Comparing these approaches is a challenging task as they rely to a great extent on the data and setup used for training. With the Low-Dose Parallel Beam (LoDoPaB)-CT dataset, we provide a comprehensive, open-access database of computed tomography images and simulated low photon count measurements. It is suitable for training and comparing deep learning methods as well as classical reconstruction approaches. The dataset contains over 40000 scan slices from around 800 patients selected from the LIDC/IDRI database. The data selection and simulation setup are described in detail, and the generating script is publicly accessible. In addition, we provide a Python library for simplified access to the dataset and an online reconstruction challenge. Furthermore, the dataset can also be used for transfer learning as well as sparse and limited-angle reconstruction scenarios.


Methods
In this section, the considered mathematical model of CT is stated first, followed by a detailed description of the dataset generation. This starts with the LIDC/IDRI database 15 , from which we extract the ground truth reconstructions. Finally, the data processing steps are described, which are also summarised in a semi-formal manner at the end of the section. As a technical reference, the script 26 used for generation is available online (https:// github.com/jleuschn/lodopab_tech_ref). Parallel beam CT model. We consider the inverse problem of computed tomography given by with: • A the linear ray transform defined by the scan geometry, • x the unknown interior distribution of the X-ray attenuation coefficient in the body, also called image, • ε a sample from a noise distribution that may depend on the ideal measurement Ax, • y δ the noisy CT measurement, also called projections or sinogram. More specifically, we choose a two-dimensional parallel beam geometry, for which the ray transform A is the Radon transform 27 . It integrates the values of x: 2 → fulfilling some regularity conditions (cf. Radon 27 ) along the X-ray lines s , for all parameters s ∈  and ϕ ∈ [0, π), which denote the distance from the origin and the angle, respectively (cf. Figure 1). In mathematical terms, the image is transformed into a function of (s, ϕ), s , ∫ ϕ = ϕ which is called projection, since for each fixed angle ϕ the 2D image x is projected onto a line parameterised by s, namely the detector. Visualisations of projections as images themselves are called sinograms (cf. Figure 2). The projection relates to the ideal intensity measurements I 1 (s, ϕ) at the detector according to Beer-Lambert's law by  where I 0 is the intensity of an unattenuated beam.
In practice, the measured intensities are noisy. The noise can be classified into quantum noise and detector noise. Quantum noise stems from the process of photon generation, attenuation and detection, which as a whole can be modelled by a Poisson distribution 28 . The detector noise stems from the electronic data acquisition system and is usually assumed to be Gaussian. It would play an important role in ultra-low-dose CT with very small numbers of detected photons 29 but is neglected in our case. Thus we model the number of detected photons and, by this, the measured intensity ratio with where N 0 is the mean photon count without attenuation and Pois(λ) denotes the probability distribution defined by For practical application, the model needs to be discretised. The forward operator is then a finite-dimensional linear map A :  n →  m , where n is the number of image pixels and m is the product of the number of detector pixels and the number of angles for which measurements are obtained. The discrete model reads 1 0 1 0 Here, Pois(λ) denotes the joint distribution of m Poisson distributed observations with parameters λ 1 , …, λ m , respectively. Note that since the negative logarithm is applied to the observations, the noisy post-log values y δ do not follow a Poisson distribution but the distribution resulting from this log-transformation. However, taking the negative logarithm is required to obtain the linear model and therefore is most commonly applied as a preprocessing step. For our dataset, we consider post-log values by default.
The Radon transform is a linear and compact operator. Therefore, the continuous inverse problem of CT is mildly ill-posed in the sense of Nashed 30,31 . This means that small variations in the measurements can lead to significant differences in the reconstruction (unstable inversion). While the discretised inverse problem is not ill-posed, it is typically ill-conditioned 28 , which leads to artefacts in reconstructions obtained by direct inversion from noisy measurements.  www.nature.com/scientificdata www.nature.com/scientificdata/ For our discrete simulation setting, we use the described model with following dimensions and parameters: • Image resolution of 362 px × 362 px on a domain of size 26 cm × 26 cm.
• 513 equidistant detector bins s spanning the image diameter.
• Mean photon count per detector bin without attenuation N 0 = 4096.

LIDC/IDRI database and data selection. The Lung Image Database Consortium (LIDC) and Image
Database Resource Initiative (IDRI) published the LIDC/IDRI database 15,21,22 to support the development of CAD methods for the detection of lung nodules. The dataset consists of 1018 helical thoracic CT scans of 1010 individuals. Seven academic centres and eight medical imaging companies collaborated for the creation of the database. As a result, the data is heterogeneous with respect to the technical parameters and scanner models.
Both standard-dose and lower-dose scans are part of the dataset. Tube peak voltages range from 120 kV to 140 kV and tube current from 40 mA to 627 mA with a mean of 222.1 mA. Labels for the lung nodules were created by a group of 12 radiologists in a two-phase process. The image reconstruction was performed with different filters, depending on the manufacturer of the scanner. Figure 3 shows examples of the provided reconstructions. The LIDC/IDRI database is freely available from The Cancer Imaging Archive (TCIA) 22 . It is published under the Creative Commons Attribution 3.0 Unported License (https://creativecommons.org/licenses/by/3.0/).
The LoDoPaB-CT dataset is based on the LIDC/IDRI scans. Our dataset is intended for the evaluation of reconstruction methods in a low-dose setting. Therefore, we simulate the projection data, which is not included in the LIDC/IDRI database. In order to enable a fair comparison with good ground truth, scans that are too noisy were removed in a manual selection process (cf. section "Technical Validation"). Additional scans were excluded due to their geometric properties, namely an image size different from 512 px × 512 px, a too small area of valid pixel values (cf. subsection "Ground truth image extraction" below), or a different patient orientation. The complete lists of excluded scan series are given in file series_list.json in the technical reference repository 26 . In the end, 812 patients remain in the LoDoPaB-CT dataset.
The dataset is split into four parts: three parts for training, validation and testing, respectively, and a "challenge" part reserved for the LoDoPaB-CT Challenge (https://lodopab.grand-challenge.org/). Each part contains scans from a distinct set of patients, as we want to study the case of learned reconstructors being applied to patients that are not known from training. The training set features scans from 632 patients, while the other parts contain scans from 60 patients each. Every scan contains multiple slices (2D images) for different z-positions, of which only a subset is included. The amount of extracted slices depends on the slice thickness obtained from the metadata. As slices with small distances are similar, they may not provide much additional information while increasing the chances to overfit. The distances of the extracted slices are larger than 5.0 mm for >45% and larger than 2.5 mm for >75% of the slices. In total, the dataset contains 35820 training images, 3522 validation images, 3553 test images and 3678 challenge images.
Remark. We propose to use our default dataset split, as it allows for a fair comparison with other methods that use the same split. However, users are free to remix or re-split the dataset parts. For this purpose, randomised patient IDs are provided, i.e., the same random ID is given for all slices obtained from one patient. Thus, when creating custom splits it can be regulated whether-and to what extent-data from the same patients are contained in different splits.
Ground truth image extraction. First, each image is cropped to the central rectangle of 362 px × 362 px. This is done because most of the images contain (approximately) circle-shaped reconstructions with a diameter of 512 px (cf. Figure 3). After the crop, the image only contains pixels that lie inside this circle, which avoids value jumps occurring at the border of the circle. While this yields natural ground truth images, we need to point out that the cropped images, in general, do not show the full subject but some interior part. Hence, it is unlikely for methods trained with this dataset to perform well on full-subject measurements. www.nature.com/scientificdata www.nature.com/scientificdata/ For some scan series, the circle is subject to a geometric transformation either shrinking or expanding the circle in some directions. In particular, for a few scan series, the circle is shrunk such that it is smaller than the cropped rectangle. We exclude these series, i.e. those with patient IDs 0004, 0032, 0102, 0116, 0120, 0289, 0368, 0418, 0541, 0798, 0926, 0972 and 1000, from our dataset, which allows to crop all included images consistently to 362 px × 362 px.
The integer Hounsfield unit (HU) values obtained from the DICOM files are dequantised by adding uniform noise from the interval [0, 1). By adding this noise, the discrete distribution of stored values is transformed into a continuous distribution (up to the floating-point precision), which is a common assumption of image models. For example, the meaningful evaluation of densities learned by generative networks requires dequantization 32 , which in some works 33 is more refined than the uniform dequantization applied to the HU values in our dataset.
In the next step, the linear attenuations μ are computed from the dequantised HU values using the definition of the HU, HU 1000 HU 1000 , which approximately correspond to an X-ray energy of 60 keV 34 The Eqs. (8) and (11) are applied pixel-wise to the images.

Projection data generation.
To simulate the measurements based on the virtual ground truth images, the main step is to apply the forward operator, which is the ray transform (Radon transform in 2D) for CT. For this task we utilise the Operator Discretization Library 35 (ODL) with the 'astra_cpu' backend 36 .
Remark. We choose 'astra_cpu' over the usually favoured 'astra_cuda' because of small inaccuracies observed in the sinograms when using 'astra_cuda' , specifically at angles 0, with l being the length of the detector. The used version is astra-toolbox==1.8.3 on Python 3.6. The tested CUDA version is 9.1 combined with cudatoolkit==8.0.
In order to avoid "committing the inverse crime" 37 , which, in our scenario, would be to use the same discrete model both for simulation and reconstruction, we use a higher resolution for the simulation. Otherwise, good performance of reconstructors for the specific resolution of this dataset (362 px × 362 px) could also stem from the properties of the specific discretised problem, rather than from good inversion of the analytical model. We use bilinear interpolation for the upscaling of the virtual ground truth from 362 px × 362 px to 1000 px × 1000 px. www.nature.com/scientificdata www.nature.com/scientificdata/ The non-normalised, upscaled image is projected by the ray transform. Based on this projection, Ax, the measured photon counts ∼ N 1 are sampled according to Eq. (7). The sampling in some cases yields photon counts of zero, which we then replace by photon counts of 0.1. Hereby strictly positive values are ensured, which is a prerequisite for the log-transform in the next step (cf. Wang et al. 38 ). The negative logarithm of the photon counts quotient max(0, 1, ∼ N 1 )N 0 is taken, resulting in the post-log measurements y δ according to Eq. (7) (up to the 0.1 photon count approximation). Finally, y δ is divided by μ max to match the normalised ground truth images. A summary of all steps can be found in Fig. 4 (Data generation algorithm).
Remark. Although the linear model obtained by the log-transform is easier to study, in some cases pre-log models are more accurate. See Fu et al. 29 for a detailed comparison. For applying a pre-log method, the stored observation data μ = δ y y / max must be back-transformed by To create physically consistent data pairs, the ground truth images should then be multiplied with μ max , too.
Remark. Note that the minimum photon count of 0.1 can be adapted subsequently. This is most easily done by filtering out the highest observation values and replacing them with −log(ε 0 /4096)/μ max , where ε 0 is the new minimum photon count.

Data Records
The LoDoPaB-CT dataset is published as open access on Zenodo (https://zenodo.org) in two repositories. The main data repository 39 (https://doi.org/10.5281/zenodo.3384092) has a size of around 55GB and contains observations and ground truth data of the train, validation and test set. For each subset, represented by *, the following files are included: • CSV files patient_ids_rand_*.csv include randomised patient IDs of the samples. The patient IDs of the train, validation and test parts are integers in the range of 0-631, 632-691 and 692-751, respectively. The ID of each sample is stored in a single row. • Zip archives ground_truth_*.zip contain HDF5 40 files of the ground truth reconstructions.
• Zip archives observation_*.zip contain HDF5 files of the simulated low-dose measurements.
• Each HDF5 file contains one HDF5 dataset named data, that provides several samples (128 except for the last file in each ZIP file). For example, the n-th training sample pair is stored in the HDF5 files observa-tion_train_%03d.hdf5 and ground_truth_train_%03d.hdf5 where the placeholder %03d is floor (n/128). Within these HDF5 files, the observation or ground truth is stored at entry (n mod 128) of the HDF5 dataset data.
The second repository 41 for the challenge data (https://doi.org/10.5281/zenodo.3874937) consists of a single zip archive: • observation_challenge.zip contains HDF5 files of the simulated low-dose measurements.
The structure inside the HDF5 files is the same as in the main repository.

technical Validation
Ground truth & data selection. Creating high-quality ground truth images for tomographic image reconstruction is a challenging and time-consuming task. In computed tomography, one option is to cut open the object after the scan or use 3D printing 42 , whereby the digital template of the object is the reference. In general, this also involves high radiation doses and many scanning angles. This combination makes it even harder to generate ground truth images for medical applications. For low-dose CT reconstruction models, the primary goal is to match the normal-dose reconstruction quality of methods currently in use. Therefore, normal-dose reconstructions from classical methods, e.g. filtered back-projection, are an adequate choice as ground truth. This simplifies the process considerably.
The ground truth CT reconstructions of LoDoPaB-CT are taken from the established and well-documented LIDC/IDRI database. An independent visual inspection of one 2D slice per scan was performed by three of the authors. Figure 3 shows three examples of such slices. A five-star rating system was used to evaluate the image quality and remove noisy ground truth data, like the first slice in Fig. 3. Scans with artefacts, e.g. from photon starvation due to dense material (cf. Figure 3 (right)), were in general not removed, as the artefacts only affect a few slices of the whole scan. The slice in the middle of Fig. 3 represents an ideal ground truth. The following procedure was then used to exclude scans based on their rating: 1. Centring of the ratings from each evaluator around the value 3. 2. Calculation of the mean rating and the variance for each looked at 2D slice. 3. For a variance <1, the mean was used as the rating score. Otherwise, the scan is evaluated by all three authors together. 4. All scans with a rating ≤2 are excluded from the dataset.
These excluded scans are listed at key "series_excluded_manual_low_q_filter" in file series_list.json in the technical reference repository 26 .

Reference reconstructions & quantitative results.
To validate the usability of the proposed dataset for machine learning approaches, we provide reference reconstructions and quantitative results for the standard www.nature.com/scientificdata www.nature.com/scientificdata/ filtered back-projection (FBP) and a learned post-processing method (FBP + U-Net). FBP is a widely used analytical reconstruction technique (cf. Buzug 28 for an introduction). If the measurements are noisy (due to the low dose), FBP reconstructions tend to include streaking artefacts. A typical approach to overcome this problem is to apply some post-processing such as denoising. Recent works 3,4,8 have successfully used convolutional neural networks, such as the U-Net 43 . The idea is to train a neural network to create clean reconstructions out of the noisy FBP results.
In this initial study, for the FBP, we used the Hann filter with a frequency scaling of 0.641. We selected these parameters based on the performance over the first 100 samples of the validation dataset. For the post-processing approach (FBP + U-Net), we used a U-Net-like architecture with 5 scales. We trained it using the proposed www.nature.com/scientificdata www.nature.com/scientificdata/ dataset by minimising the mean squared error loss with the Adam algorithm 44 for a maximum of 250 epochs with batch size 32. Additionally, we used an initial learning rate of 10 −3 , decayed using cosine annealing until 10 −4 . The model with the highest mean peak signal-to-noise ratio (PSNR) on the validation set was selected from the models obtained during training. Sample reconstructions are shown in Fig. 5. Table 1 depicts the obtained results in terms of the peak signal-to-noise ratio (PSNR) and structural similarity 45 (SSIM) metrics (cf. "Evaluation practice" in the next section for a detailed explanation). As it can be observed, the post-processing approach, which was trained using the proposed dataset, outperforms the classical FBP reconstructions by a margin of 5 dB. This demonstrates that the dataset indeed contains valuable data ready to be used for training machine learning methods to obtain CT reconstructions with higher quality than the standard methods.

Usage Notes
Download & easy access. The whole LoDoPaB-CT dataset 39,41 can be downloaded directly from the Zenodo website. However, we recommend the Python library DIVα 46 (https://github.com/jleuschn/dival) for easy access of the dataset. The library includes specific functionalities for the interaction with the provided dataset.
Remark. Access to the dataset on Zenodo might be restricted or slow in some regions of the world. In this case please contact one of the corresponding authors to get an alternative download option.
DIVα is also available through the package index PyPI (https://pypi.org/project/dival). With the library, the dataset is automatically downloaded, checked for corruption and ready for use within two lines of Python code: from dival import get_standard_dataset dataset = get_standard_dataset('lodopab').
Remark. When loading the dataset using DIVα , an ODL 35 RayTransform implementing the forward operator is created. This requires a backend, the default being 'astra_cuda' , which requires both the astra toolbox 36 and CUDA to be available. If either is unavailable, a different backend ('astra_cpu' or 'skimage') must be selected by keyword argument impl.
In addition, DIVα offers multiple options to work with the LoDoPaB-CT dataset: • Access the train, validation and test subset and draw a specific number of samples.
• Sort the data by the patient ids.
• Use the pre-log or post-log data (cf. projection data generation in the "Methods" section).
• Evaluate the reconstruction performance.
• Compare with pre-trained standard reconstruction models. evaluation practice. Since ground truth data is provided in the dataset, we recommend using so-called full-reference methods for the evaluation. The peak signal-to-noise ratio (PSNR) and the structural similarity 45 (SSIM) are two standard image quality metrics often used in CT applications 42,47 . While the PSNR calculates pixel-wise intensity comparisons between ground truth and reconstruction, SSIM captures structural distortions.
Peak signal-to-noise ratio. The PSNR expresses the ratio between the maximum possible image intensity and the distorting noise, measured by the mean squared error (MSE), 1 www.nature.com/scientificdata www.nature.com/scientificdata/ where μ j and j μ are the average pixel intensities, j σ and σ j the variances and Σ j the covariance of x and x at the j-th local window. Constants = C K L ( ) 1 1 2 and = C K L ( ) 2 2 2 stabilise the division. Following Wang et al. 45 we choose K 1 = 0.01 and K 2 = 0.03 for the technical validation in this paper. The window size is 7 × 7 and L = max(x) − min(x).
Test & challenge set. The test data is the advised subset for offline model evaluation. To guarantee a fair comparison, the data should be in no way involved in the training process or hyperparameter selection of the model. We recommend using the whole test set and select the above-mentioned parameters for PSNR and SSIM. Deviations from this setting should be mentioned.
In addition, a challenge set without ground truth images is provided. We encourage users to submit their challenge reconstructions to the evaluation website (https://lodopab.grand-challenge.org/). All methods are assessed under the same conditions and with the same metrics. The performance can be directly compared with other methods on a public leaderboard. Therefore, we recommend to report performance measures on the challenge set for publications that use the LoDoPaB-CT dataset without modifications, in addition to any evaluations on the test set. In accordance with the Biomedical Image Analysis (BIAS) guidelines 48 , more information about the challenge can be found on the aforementioned website.
Further usage. Scan scenarios. The provided measurements and simulation scripts can easily be modified to cover different scan scenarios: • Limited and sparse-angle problems can be created by loading a subset of the projection data, e.g. a sparser setup with 200 angles was already used by Baguer et al. 25 . • Super-resolution experiments can be mimicked, by artificially binning the projection data into larger pixels. • To study lower or higher photon counts, the dataset can be re-simulated with a different value of N 0 (e.g. using resimulate_observations.py 26 by changing the value of PHOTONS_PER_PIXEL).
The provided reconstructions can still be used as ground truth for all listed scenarios.
Transfer learning. Transfer learning is a popular approach to boost the performance of machine learning models on smaller datasets. The idea is to first train the model on a different, comprehensive data collection. Afterwards, the determined parameters are used as an initial guess for fine-tuning the model on the smaller one. In general, the goal is to learn to process low-level features, e.g. edges in images, from the comprehensive dataset. The adaption to specific high-level features is then performed on the smaller dataset. For imaging applications, the ImageNet database 49 , with over 14 million natural images, is frequently used in this role. The applications range from image classification 50 to other domains like audio data 51 . Transfer learning has also been successfully applied to CT reconstruction tasks. This includes training on different scan scenarios 52,53 , e.g. a different number of angles, as well as first training on 2D data and continuing on 3D data 54 . He et al. 55 simulated parallel beam measurements on some of the natural images contained in ImageNet. Subsequently, the training was continued on CT images from the Mayo Clinic 10 . LoDoPaB-CT, or parts of the dataset, can be used in similar roles for transfer learning. Additionally, the ground truth data from real thoracic CT scans may be advantageous for similar CT reconstruction tasks compared to random natural images from ImageNet 56 .
Nonetheless, we advise the user to check the applicability for their specific use case and reconstruction model. Re-simulation or other changes to the LoDoPaB-CT dataset might be needed, especially for datasets with different scan geometries. Additionally, simulated data can not capture all aspects of real-world measurements and therefore cause reconstruction errors. For a comprehensive study on the benefits and challenges of transfer learning for medical imaging, we refer the reader to the publication by Raghu et al. 56 .
Remark. An example for a simulation script with a fan beam geometry on the ground truth data can be found in the DIVα 46 library: dival/examples/ct_simulate_fan_beam_from_lodopab_ground_ truth.py.
Limits of the dataset. The LoDoPaB-CT dataset is designed for a methodological comparison of CT reconstruction methods on a simulated low-dose parallel beam setting. The focus is on how a model deals with the challenges that arise from low photon count measurements to match the quality of normal-dose images. Of course, this represents only one aspect of many for the application in real-world scenarios. Therefore, results achieved on LoDoPaB-CT might not completely reflect the performance on real medical data. The following limits of the dataset should be considered when evaluating and comparing results: • The simulation uses the Radon transform and Poisson noise. Real measurements can be influenced by additional physical effects, like scattering. • Modern CT machines use advanced scanning geometries, like helical fan beam or cone beam. Specific challenges for the reconstruction can arise compared to parallel beam measurements (cf. Buzug 28 ). • In general, the goal is to reconstruct a whole 3D subject and not just a single 2D slice. Reconstruction methods might benefit from additional spacial information. On the other hand, requirements on memory and compute power can be higher for methods that reconstruct 3D volumes directly.