The cutaneous microvasculature plays a pivotal role in skin physiology, regulating essential functions such as temperature control1, nutrient delivery2, and immune responses3. Its disruption is implicated in various dermatologic and systemic conditions, from psoriasis4 to diabetes mellitus5. Understanding the cutaneous microvasculature is crucial for elucidating the pathogenesis of these conditions and developing effective therapeutic strategies6. Optical coherence tomography angiography (OCTA) has emerged as a highly useful imaging modality in these applications, providing non-invasive, high-resolution visualization of microvascular structures7. By discerning subtle abnormalities, OCTA holds promise for advancing our understanding of cutaneous microvasculature dynamics and its implications in health and disease.

OCTA uses analysis of the temporal variations in backscattered light properties to visualize blood flow dynamics, enabling visualization of tissue microvasculature8. Speckle patterns emerge from the interference of backscattered light from random scatterers in imaging systems such as optical coherence tomography (OCT)9. The dynamic nature of moving scatterers, such as blood cells in microvessels, can alter the scattering properties of the medium, thereby influencing the speckle pattern observed in OCT imaging. Consequently, the temporal dynamics of speckle patterns in OCT can be used to differentiate blood vessels from surrounding static tissues10. To analyse the variation of speckle with time, multiple OCT B-scans images will be acquired at the same location. As speckle in static tissue is temporally invariant, the degree of change in speckle at each location is indicative of any displacement within the tissue, and hence indicative of blood flow. Techniques such as speckle variance11 and correlation12 analysis quantify these temporal changes. However, challenges such as high noise levels and low signal-to-noise ratio (SNR) in OCTA data can significantly reduce accuracy in the detection of blood flow13,14. Increasing the number of B-scan acquisitions at each location improves SNR15 but also extends image acquisition time. Long image acquisition time increases the likelihood of motion artefacts16, particularly from cardiac or respiratory motion. This necessitates a delicate balance for optimal image quality and vessel delineation.

Variations in speckle are innately stochastic and noisy. In traditional speckle variance and decorrelation calculations, estimates of the change in speckle intensity are calculated using a sliding filter over the data, so that OCT pixels within a small, fixed-size region of the co-located B-scans contribute to the estimate. The focus on the sliding filter often fails to make optimal use of local information from neighbouring pixels for vessel detection. In contrast, deep learning effectively exploits the continuity of microvessel structures, using pixel adjacency to enhance detection accuracy17. Additionally, deep learning offers automatic feature learning and improved generalisation of pixel relationships, making it promising for OCTA reconstruction. Several studies have demonstrated that deep learning approaches present advantages over different OCTA statistical reconstruction methods including split-spectrum amplitude-decorrelation angiography, power intensity differential, speckle correlation methods18, and split-spectrum amplitude and phase-gradient angiography18,19,20. For example, research has shown that employing only two B-scans with deep learning yields better results than using four B-scans with traditional speckle decorrelation techniques18. Deep learning methods also have the potential to yield enhanced resolution, creating high-resolution angiograms from undersampled data21,22. This highlights the capability of deep learning to significantly improve both the quality and resolution of OCTA images, enhancing the overall efficiency of the reconstruction process.

While previous studies have shown the potential of deep learning-based models to enhance processing speed and improve vessel contrast in OCTA images, both traditional methods based on statistical analysis and previously reported deep learning-based methods typically focus solely on extracting blood flow information from individual 2D B-scans. These scans, which are essentially cross-sectional images (X–Z) perpendicular to the slow axis (Y) of OCT scanning, are acquired at different times at the same slow-axis position. However, restricting calculations to within a single block of co-located B-scans fails to fully exploit the contextual information from neighbouring scans along the adjacent slow axis positions. Blood vessels are fundamentally three-dimensional structures that extend beyond the narrow confines of individual B-scans, spanning multiple B-scans. These 2D methods overlook crucial neighbourhood information that could enhance the precision of image reconstruction. This leads to a lack of spatial context, especially along the slow axis. Consequently, despite their ability to incorporate temporal information within the 2D B-scan plane, these approaches may not fully utilise all available 3D spatial data, potentially compromising the accuracy of OCTA image reconstruction.

Initial attempts to incorporate this richer spatial context, such as employing Pseudo-3D neural networks 23, only partially utilised 3D information, limiting their efficacy to fully exploit the vessel’s structure. Pseudo-3D methods, exemplified by NI-P3D-U23, have integrated parallel 2D spatial and temporal convolutions within units such as the Pseudo-3D unit, which are a computationally-lean approach to capture spatial and temporal dependencies. While this approach has been shown to enhance the model’s capability to encode information across different dimensions, it may still introduce challenges such as oversmoothing of features and fails to allow the neural network to fully learn 3D relationships. In contrast, 3D neural network (NN) models like 3D Unet incorporate true 3D convolutions, potentially capturing more intricate spatial features and dependencies within volumetric OCTA data. We hypothesise that a true 3D network architecture has greater potential to better preserve fine details and boundaries in the reconstructed OCTA images. Additionally, 3D NN models naturally handle anisotropic data and efficiently utilise spatial context along all three dimensions, contributing to more accurate segmentation and reconstruction of OCTA images.

In our study, we develop a novel approach that fully leverages the 3D spatial features of OCTA data using 3D convolutional neural networks (CNNs). Through a comparative study on human skin datasets, we evaluated the performance of two different 3D CNN architectures: 3D Unet (3DU) and 3D Unet with squeeze excitation (SE) block (3DUS), in comparison to a 2D Unet (2DU). We compare the speckle decorrelation images generated by these neural networks, highlighting differences in their estimates of speckle decorrelation. This provides a more direct comparison than from the resulting OCTA images, which involve additional image processing that may obscure differences in the results. Our findings reveal that both 3DU and 3DUS architectures significantly outperformed the 2DU architecture, demonstrating improvements in accuracy. Specifically, the utilisation of 3D CNNs facilitated better preservation of fine details and boundaries in the reconstructed OCTA images. Furthermore, we observed that incorporating the SE block into the 3DUS architecture contributed to a more stable training process and reduced overfitting, akin to reports seen in other 2D neural networks24. However, the improvement in image accuracy from adding the SE block in 3D Unet was less pronounced than adding the SE block to the 2D Unet. This discrepancy can be attributed to the improvement in accuracy that already arises from progressing from a 2D Unet to a 3D Unet architecture, leaving less scope for further improvement from the SE block.

In conclusion, our study highlights the effectiveness of 3D CNN architectures in improving OCTA image reconstruction, specifically in capturing spatial details for microvasculature analysis. By exploiting the continuous 3D structure inherent in blood vessels, our method enhances image quality and accuracy while minimising the necessary number of co-located B-scans, thus potentially reducing image acquisition time, and increasing clinical applicability. This advancement shows promising potential for enhancing clinical practices and advancing the utilisation of OCTA to understand cutaneous microvascular dynamics.

Materials and methods

Comparative analysis of OCTA reconstruction techniques

A common computational approach in OCTA is to quantify the change in speckle intensity at M B-scan blocks acquired at adjacent OCT slow-axis positions, each block comprising N co-located B-scans, as illustrated in Fig. 1a. This technique is a fundamental component of a wide range of OCTA algorithms. While traditional techniques such as speckle decorrelation facilitate direct angiogram reconstruction from consecutive B-scans, deep learning-based methodologies rely on paired input-label data for training.

Figure 1
figure 1

OCTA pipeline with deep learning: (a) Preparing instance-label pairs from B-scan blocks. (b, c) Training the neural network with the prepared pairs. This demonstrates the extraction of blood flow information from OCT signals, analysing M B-scan blocks with N consecutive OCT B-scans each to generate angiograms representing vascular structures. In (b), it emphasises the utilisation of paired input-label data of a single B-scan block and the corresponding labels. Conversely, our approach, as shown in (c), integrates neighbouring B-scan block information, thereby enhancing reconstruction accuracy.

The Deep Learning-based OCTA pipeline, inspired by the framework proposed by Liu et al.18, treats OCTA reconstruction as an end-to-end image translation task, comprising three phases: training data preparation, model learning, and OCTA prediction. The training dataset consists of instance-label pairs \(\{ (I^{m} ,L^{m} )\}_{m = 1}^{M}\), where each pair corresponds to a ground truth image and its associated multi-channel OCT structural image \(I \in R^{N \times H \times W}\). However, in this method, upon which previous deep learning methods were based, the label at each slow-axis location (yi) was only associated with the B-scans acquired at the same slow-axis location (Fig. 1b). The neighbourhood B-scan information at yi-1 and yi+1 was not utilised. As depicted in Fig. 1b, each input B-scan at location yi is correlated with the label calculated from the B-scans at the same location. Here, each 'channel' within the OCT structural image represents a B-scan, illustrating cross-sectional images acquired at specific locations sequentially in time. The indices m, N, H, and W respectively denote the image pair and the numbers of input channels, rows, and columns in the image.

Our approach incorporates neighbourhood information by utilising B-scan blocks from both the previous (yi-1) and next (yi+1) locations (Fig. 1c). Thus, the input consists of three B-scan blocks, each comprising N co-located B-scans \(I \in R^{3 \times N \times H \times W}\). As shown in Fig. 1b the number 3 associated with the slow-axis position represents the channel, and each B-scan within the B-scan block (channel) is considered as the depth of the input, making the input spatially 3D. This approach enables the model to capture contextual information from neighbouring locations.

Experimental protocol and OCTA dataset preparation

The study utilised a spectral domain optical coherence tomography (SD-OCT) system (Telesto III, Thorlabs GmbH, Germany). This system was equipped with a laser source emitting at a central wavelength of 1300 nm, providing an axial resolution of 5 μm in tissue (assuming a refractive index of 1.43). Scans were conducted on the forearm and dorsum of the foot of human subjects using a custom spacer. This spacer was equipped with a heating element capable of locally elevating the skin temperature to 42 °C. Previous research has established that such heating induces vasodilation of the microvasculature, enhancing blood flow rate25. Two types of scans were acquired: baseline scans at room temperature (B) showing basal microvascular flow, and heated scans after 30 min of localised skin heating (H) showing fully vasodilated microvasculature.

Five healthy participants (one female and four males, age range 21–63 years) were enrolled with approval from The University of Western Australia’s Human Research Ethics Committee, adhering to the Declaration of Helsinki. Informed consent was obtained from all participants. Data acquisition utilised two field-of-view settings: 5 mm × 5 mm × 2.5 mm and 3.2 mm × 3.2 mm × 2.5 mm, with corresponding pixel dimensions of 1000 × 500 × 1024 and 1000 × 320 × 1024, respectively. OCT A-scans were obtained at a rate of 76 kHz. To enhance the SNR, 10 consecutive B-scans were acquired at each slow-axis position (N = 10) to establish the ground truth. Speckle decorrelation analysis was performed utilising a sliding window technique with a 6 × 6 (x × z) filter across each B-scan block, using all 10 co-located B-scans in each block to generate label data. Different methods can be used to quantify speckle decorrelation, including Pearson correlation12, L1-norm, L2-norm26 and motion contrast function13. In this study, we use the L1-norm, as described in our earlier work27,28. This involves calculating the average of the absolute values of intensity differences within the window. Training and validation datasets included baseline (B) and heated (H) samples, resulting in 3780 cross-sectional image pairs for training and 820 for validation. Preprocessing involved creating overlapped patches of input B-scans to enhance spatial information capture and model performance. Speckle decorrelation values were normalised to lie within the range [0,1].

Model learning and neural networks’ structures

The learning set was divided into training and validation sets. Three encoder-decoder neural networks (NNs) were assessed in our model implementation: the 2D-Unet (2DU) (Fig. 2a), 3D-Unet (3DU) (Fig. 2b), and 3D-Unet with SE block (3DUS) (Fig. 2c). We chose the Unet encoder-decoder structure due to its established success in OCTA tasks, particularly OCT to OCTA translation19,29,30. The performance of the 2DU was evaluated to measure the impact of incorporating 3D convolutions in the 3DU and its advanced version, 3DUS. Unlike the 2DU, both 3DU and 3DUS utilise 3D convolutions to extract three-dimensional features from input data. The addition of the SE block was motivated by its effectiveness in improving image segmentation and classification tasks by explicitly modelling channel interdependencies, enabling enhanced feature recalibration24. Unlike the conventional speckle decorrelation algorithm, which weights local information within a sliding window equally, the NN architectures leverage local information more effectively. During training, we minimized the mean squared error (MSE) loss between the generated speckle decorrelation map and the label image.

Figure 2
figure 2

The architectures of the used NNs: (a) 2D Unet (2DU), (b) 3D Unet (3DU), and (c) 3D Unet with SE Block (3DUS).

The training process involved forward propagation, loss function computation, and backpropagation, with model performance assessed using MSE on the validation set. We employed a batch size of 2 and trained for 20 epochs with a learning rate of 5e-4 using the Adam optimiser with a weight decay of 1e-5. To dynamically adjust the learning rate during training based on monitored validation loss, we utilised a learning rate scheduler that reduced the learning rate by a factor of 0.1 if no improvement was observed for two consecutive epochs. This adaptive strategy aimed to enhance convergence and stability, ensuring efficient optimisation. The 2D networks were run on NVIDIA GeForce RTX 3080 Ti GPUs using Pytorch, while the 3D networks utilised the Phoenix High Performance Computing (HPC) facilities at the University of Adelaide. For the 3D networks, the system included Nvidia A100 GPUs (4 units), 72 CPU cores with hyperthreading disabled, and 515,000 MiB of available RAM.

OCTA prediction and evaluation metrics

After preparing the datasets and training the models, we deployed the trained encoder-decoder NNs to predict OCTA images. This prediction process involved feeding OCT structural images, comprising three channels of the three adjacent B-scan blocks, each block consisting of N B-scans, from an unseen test dataset through the decoder part of the networks, and generating predicted OCTA images showing a measure of speckle decorrelation at each location. It is important to note that the test datasets were intentionally kept separate and unseen during the training phase to ensure an unbiased assessment of the model's generalisation capability.

Before evaluating the accuracy of the OCTA predictions using quantitative metrics such as the structural similarity index measure (SSIM)31 and peak signal-to-noise ratio (PSNR), several preprocessing steps were applied. Firstly, to focus the evaluation on the high signal tissue region, typically close to the skin surface, the images were cropped to include only the area from the skin's surface up to approximately 750 µm (150 pixels) below the skin. Beyond this depth in the highly scattering skin, the SNR of the OCT data was found to degrade rapidly due to multiple scattering. Subsequently, both the NN predictions and the ground truth labels were normalised and scaled to the grayscale range (L = 255, L, representing the dynamic range of the pixel values) to facilitate fair comparison across different NNs and N input values. To compute the PSNR, the MSE derived from grayscale images was utilised within the formula, involving scaling the predictions generated by the NN and the ground truth labels to a range of [0, 255].

For SSIM calculation, empirical parameters were set to the conventional values32,33 as follows: k1 = 0.01 and k2 = 0.03, where k1 and k2 are constants used to stabilise the division with a weak denominator. Additionally, a Gaussian window of size 11 × 11 was selected, and SSIM values were calculated within these patches. The SSIM values were then averaged over these patches for each pair of B-scan block input and label.

Results and discussion

In this section, we investigate the effectiveness of using 3D Unet models with 3D convolutions, compared to 2D Unet models, to establish whether this can enhance the utilisation of local information and exploit the three-dimensional nature of features. Additionally, we explore how adjusting the number of input OCT B-scans within each B-scan block (notated by N) impacts the performance of the three neural networks: 2DU, 3DU, and 3DUS. Our analysis is based on validation datasets that include both baseline (unheated) and heated datasets.

Figure 3 provides a detailed overview of the neural networks' performance at N = 2 and N = 4. This figure offers insights into how the networks behave during both the training and validation phases. Note that the number of channels (N) varies for the 2DU model (equivalent to N), whilst the 3D Unet models maintain three channels, but the depth dimension changes (as depicted in Fig. 1).

Figure 3
figure 3

The performance of 2DU, 3DU, and 3DUS NNs for N = 2 and N = 4 is depicted for training (a) and validation (b) datasets from epochs 1–20. Overfitting and convergence issues are observed particularly for the 2D NN.

In Fig. 3a, we observe a consistent decrease in training loss from epochs 1 to 20 as N increases from 2 to 4. This indicates that the networks demonstrate improved learning with a larger number of input B-scans, implying a positive relationship between the input data volume and the networks' training performance. Conversely, Fig. 3b illustrates the validation loss over the same epochs for N = 2 and N = 4. In Fig. 3a, we observe that using a 3D architecture has led to a reduction in training loss compared to the 2D neural network. Furthermore, incorporating the SE Block into the 3D Unet architecture (3DUS) has resulted in a further incremental improvement in the training loss.

In Fig. 3b, we examine the validation loss over the same epochs for both N = 2 and N = 4. We notice a distinct pattern for the 2D Unet model: it shows significant oscillations and struggles to converge smoothly. Comparing the training loss of the 2DU in Fig. 3a with the validation loss of 2DU in Fig. 3b, we find notable differences between them. This discrepancy suggests that the 2D Unet model may be overfitting, raising doubts about its ability to generalise to new, unseen data. In contrast, the 3D models exhibit fewer oscillations and smoother convergence. The difference between the validation and training losses is also smaller, indicating better generalisation capability. Additionally, when we add the SE block, the validation loss decreases further, and the model's training becomes more stable.

Overall, these observations suggest that transitioning from 2D to 3D models, along with incorporating the SE block, can improve the stability of training and enhance the model's ability to generalise to new data.

In this section, we conduct a comparative analysis between the three NNs: 2DU, 3DU, and 3DUS. Our assessment uses standard metrics PSNR and SSIM, which are computed between the outputs of the NNs and the ground truth label data. The evaluation encompasses test datasets comprising varying numbers of input OCT B-scans (N), including both baseline and heated datasets. The results are presented in Fig. 4a-d, where Fig. 4a and b depict SSIM values for the baseline and heated datasets, respectively, and Fig. 4c and d illustrate PSNR values for the same datasets.

Figure 4
figure 4

Comparing the performance of 2D Unet (2DU, green circles), 3D Unet (3DU, blue triangles), and the 3D Unet with SE block (3DUS, red triangle) using standard metrics (PSNR and SSIM) against ground truth labels. Evaluation is conducted across varying numbers of input OCT B-scans (N) from baseline and heated test datasets, with SSIM values presented in (a and b) and PSNR values in (c and d).

Our results demonstrate a consistent improvement in PSNR and SSIM values with increasing N for the heated sample (Fig. 4b and d). Similarly, the PSNR and SSIM values for the baseline test sample were generally improved with increasing N for the 3 neural network architectures, except for the 2DU, where we observed slight fluctuations in the metrics (Fig. 4c and d). We also observed less sensitivity in the baseline sample to changes in N than in the heated sample due to lower blood flow, resulting in less impact on performance with varying N. These findings suggest that for baseline scans, where flow is limited, 2 B-scans per block (N = 2) with a 3D NN architecture gives superior performance to N = 4 with a 2D NN architecture. This improvement in performance may be due to multiple reasons, including the inclusion of 3D spatial context or improved data aggregation through the use of a 3D convolutional layer. Both of these mechanisms of improvement are enabled by the use of a 3D Unet that is better able to exploit the inherently 3D structure of blood vessels.

Moreover, we investigated integrating the SE block with 3D Unet. By incorporating the SE block into the 3D Unet, we aimed to dynamically recalibrate features, thereby enhancing stability and accuracy. These results indicate that combining 3D Unet with the SE block in imaging techniques provides an additional incremental improvement over simpler 3D architecture.

Figure 5 illustrates the output weights of the SE block normalised to the maximum value weight for the 3DUS with 3 channels, each with a depth of 4 OCT images. We note that the middle channel (2nd channel), corresponding to the middle B-scan block (y = i), exhibits higher weights compared to the neighbouring B-scan blocks (1st and 3rd channel). This observation can be understood as the middle B-scan block corresponds to the B-scans used for calculating the label using the traditional (non-neural network) speckle decorrelation method, although we note that the adjacent blocks still have an impactful weighting, leading to the improved performance of both 3D NN architectures.

Figure 5
figure 5

Output weights of the SE block normalised for the 3DUS model with three channels, each with a depth of four OCT images. The middle channel (2nd channel) exhibits higher weights, indicating its importance in generating the speckle decorrelation map. Neighbouring B-scan blocks also show significant influence in extracting useful information for prediction.

In this section, we explore the implications of transitioning from a 2D neural network (2DU) to a 3D neural network (3DU) on the generation of speckle decorrelation images in OCTA reconstruction. Figure 6 visually compares the reconstructed decorrelation images, with Fig. 6a depicting the first OCT B-scan from a series of ten consecutive B-scans obtained from the heated sample along the slow axis. Note that the images generated by the neural networks are speckle decorrelation images, without applying post-processing steps such as filtering and smoothing to reconstruct the typical OCTA images. By comparing the calculated speckle decorrelation images rather than OCTA images, we are able to more accurately identify subtle differences in the results.

Figure 6
figure 6

(a) Initial OCT B-scan at an arbitrary slow axis location with minimal signal attenuation outlined by dashed pink lines. The Superficial and Deeper regions used for calculations shown in Fig. 8 are indicated by the blue and green areas. (b) Ground truth speckle decorrelation map. (c) Enlarged view of the region highlighted by a green rectangle in Fig. 6b. (d, e) Speckle decorrelation predictions by the 2DU and 3DU, respectively, for the highlighted region for N = 2. (f, g) Squared errors illustrating discrepancies between the ground truth in Fig. 6b and predictions from the 2DU and 3DU, as shown in Fig. 6d and e, respectively. (h, i) SSIM maps comparing the fidelity of 2DU and 3DU predictions to the label in Fig. 6b, highlighting regions with better structural similarity. Purple rectangle underscores the vascular region where the 3DU achieves greater accuracy compared to the 2DU.

A glass window was placed on top of the skin during imaging and a thin layer of glycerol was applied between the tissue and glass window. This has been found to reduce the occurrence of imaging artefacts caused by surface irregularities on the skin surface34. The glass window was angled at 4° to the imaging light beam to reduce parasitic reflections from the air-glass window interface. The use of glycerol also helps to match the refractive index mismatch that occurs at the tissue surface, allowing the OCT light to better couple into the skin and improving image quality.

The superficial 750 µm of tissue in Fig. 6a, delineated by dashed pink lines, represent areas with minimal OCT signal attenuation, utilised for subsequent PSNR and SSIM metric computations. Conversely, regions below exhibited poor SNR due to depth-related signal attenuation, while regions above were outside of the tissue, and both were thus excluded from analysis. Figure 6b illustrates the speckle decorrelation map, serving as the ground truth, with bright regions indicating high speckle decorrelation and dark regions indicating low speckle decorrelation. A zoomed inset is highlighted by a green rectangle, where the magnified view of this region is presented in Fig. 6c, enabling a more detailed view of the microvascular network. The generated speckle decorrelation images produced by the 2DU and 3DU of the magnified region for N = 2 are depicted in Fig. 6d and e, respectively. The squared error of the prediction with the label is visualized in Fig. 6f and g for the 2DU and 3DU, respectively, where higher error values are represented by red colour, indicating higher prediction error. Notably, the 3DU prediction exhibits less error (less red regions) compared to the 2DU prediction, suggesting improved accuracy.

To comprehensively compare the 2DU and 3DU, SSIM maps of the NN predictions against the label were computed and presented in Fig. 6h and i, respectively. A higher SSIM value, represented graphically by a whiter map, indicates better agreement between the predictions and the label. The SSIM map corresponding to the 3DU reveals larger white regions and greater brightness compared to that of the 2DU, indicating higher accuracy and fidelity in reconstruction. For clarity, an area containing a vessel is highlighted by purple rectangular boxes in the magnified views. Comparing the squared error (Fig. 6d and e) and SSIM (Fig. 6h and i) values in these highlighted regions reveals that the 3D neural network achieves greater accuracy in predicting blood vessels compared to the 2D neural network.

In summary, the transition from a 2D to a 3D neural network architecture enhances OCTA reconstruction accuracy. The incorporation of 3D convolutions in the 3DU leads to a reduction in prediction error, as indicated by the lower MSE values, and a corresponding improvement in image quality, demonstrated by the higher SSIM values. Higher MSE values indicate increased prediction error, signifying greater deviation of predicted images from ground truth images in terms of pixel values. Conversely, higher SSIM values suggest a closer alignment between predictions and labels, reflecting enhanced reconstruction precision and fidelity in the 3DU. Therefore, the observed improvements in both MSE and SSIM metrics indicate improved performance of the 3DU architecture in accurately reconstructing OCTA images.

In this section, we analyse how tissue depth affects the predictions of the NN for N = 2. Figure 7a and b display the metrics PSNR and SSIM across planes that are parallel to the surface of the tissue, respectively. These planes have been defined on the heated OCT sample shown in Fig. 6. The increased blood from local vasodilation in the heated samples provides more opportunity to see quantitative differences between the neural network architectures.

Figure 7
figure 7

Analysis of NN predictions for N = 2 across tissue depths. This figure depicts (a) SSIM and (b) PSNR metrics along planes parallel to the tissue surface, from the surface to 150 pixels below.

These measurements span from the tissue's surface down to 750 µm (equivalent to 150 pixels) below the surface, covering the extent of the high SNR region selected for analysis.

We observe that in deeper regions of tissue, characterised by lower SNR values, the PSNR decreases compared to the more superficial tissue. This decline in PSNR is primarily attributed to increasing noise levels as multiple scattering becomes more prevalent in the OCT signal. Other studies have corroborated this finding, demonstrating that as noise is increased in the images, the PSNR decreases correspondingly35.

This underscores the sensitivity of PSNR to noise levels, particularly impacting image quality in deeper tissue layers. In contrast, SSIM remains relatively stable near the surface, indicating preserved structural similarity between predicted and ground truth images. Studies have confirmed that PSNR is more responsive to additive Gaussian noise compared to SSIM36. However, as we explore deeper layers, structural differences in the tissue become more pronounced, leading to a gradual decrease in SSIM values. This suggests that while PSNR primarily reflects noise impact, SSIM considers the structural similarity between images. The distinct behaviours of PSNR and SSIM emphasise the importance of considering depth-related variations when assessing NN performance. Comparing the neural networks, it is evident that the PSNR and SSIM values of the 3D NNs (3DU and 3DUS) are higher than those of the 2D NN across all depths. This indicates superior performance in terms of both image quality and structural similarity across various depths. As with our earlier results in Fig. 4, we find that the inclusion of the SE block with the 3D Unet provides an incremental improvement in results.

In Figs. 6 and 7, we have delineated and labelled two distinct regions to subsequently analyse the impact of depth on neural network performance. The superficial region, extending from 60 to 90 pixels beneath the surface, is outlined in blue, while the deeper region, ranging from 120 to 150 pixels below the surface, is marked in green. Figure 8 displays the enface (x × y) display of speckle decorrelation values across these depths, utilising the Maximum Intensity Projections (MIPs) technique. This approach gives the opportunity to better analyse and compare the NNs performances at various depths with different noise levels.

Figure 8
figure 8

Maximum intensity projection (MIP) image of both superficial and deeper regions of the label (a, e) alongside predictions from 2DU, 3DU, and 3DUS models (bd, fh). Blue rectangles in the superficial region's MIP indicate areas magnified in the second row, while green rectangles in the deeper region’s MIP correspond to magnified views in the subsequent row. Blue, purple and green circles highlight differences, demonstrating the superior performance of 3D NNs over the 2D NN.

In each of the four subfigures of the first and third rows in Fig. 8, which display enface (x × y) views of speckle decorrelation values in the superficial and deep regions, respectively, zoomed insets are highlighted by blue and green rectangles, respectively. The magnified views of these regions are presented in the second and fourth rows, enabling a more detailed examination of the neural networks' performance in capturing the microvascular network's features. The images reveal that the neural networks have contributed to smoothing the images, evident in the more uniform intensities. However, this smoothing effect, while reducing noise37,38, has also resulted in a loss of small features18. The degradation of high frequency features can be attributed to the inherent smoothing nature of the local convolution function that is a core component of CNNs. CNNs average spatial information over local neighbourhoods, effectively applying a low pass spatial filter to the data. Our results suggest that the use of a 3D architecture allows this spatial filter to more intelligently filter the data, better preserving these detailed features.

We also observe that in the superficial region, due to lower noise levels, the neural network results are closer to the label, corroborating the findings presented in Fig. 7 where the PSNR and SSIM values for the superficial region (outlined in blue) are higher than those for the deeper region (outlined in green). Additionally, due to lower signal intensities in the deeper regions, the 3D neural network more closely aligned with the label, exhibiting fewer artificial smoothing effects. To clarify, by examining the areas outlined by purple, blue, and green circles in the deeper regions, it is observed that the predictions made by the 3D U-Net architectures illustrate the advantage of 3D neural networks in more accurately representing the true speckle decorrelation values compared to conventional 2DU. Furthermore, comparing the results of the 3DUS with the 3DU reinforces our earlier findings that incorporating the SE block has slightly reduced smoothing errors and resulted in predictions that are incrementally closer to the label.

The method presented here incorporated both temporal information (repeated, co-located B-scans) and spatial information (adjacent B-scans). Earlier work by our team demonstrated that OCTA images may be extracted from OCT volumes in which the B-scans are not co-located, but instead closely spaced (1µm spacing)39. Recent work by Le et al.40 demonstrated that a neural network trained on repeated, co-located B-scans is able to generate OCTA images when presented with adjacent, non-collocated B-scans. Part of the innovation in that work was to perform image registration between adjacent B-scans as each vessel may change position at different B-scan locations, either because of the vessel shape or from movement artefacts. That work highlighted the potential gains of incorporating additional corrections to account for physiologically plausible changes and provided an avenue for future improvements in the use of a 3D CNN.

Conclusion

In conclusion, our study presents a novel approach utilising 3D convolutional neural networks to address the limitations of traditional speckle decorrelation computation and prior deep learning image reconstruction methods in optical coherence tomography angiography. Through a comprehensive comparative analysis on human skin datasets, we demonstrated the superiority of 3D CNN architectures, particularly 3D Unet (3DU) and 3D Unet with squeeze excitation (SE) block (3DUS), over a 2D Unet (2DU) architecture. Both 3DU and 3DUS architectures outperformed 2DU, showing measurable improvements in accuracy and preservation of fine details and boundaries in reconstructed OCTA images.

Our findings underscore the importance of considering the 3D spatial context in OCTA image reconstruction. Furthermore, incorporating the SE block into the 3DUS architecture contributed to a more stable training process and reduced overfitting, further improving performance.

Overall, our study highlights the effectiveness of 3D CNN architectures in enhancing OCTA image reconstruction, offering promising potential for improving clinical practices and advancing our understanding of cutaneous microvascular dynamics crucial for cardiovascular health and thermoregulation. By leveraging 3D spatial features, our method not only enhances image quality and accuracy but also reduces the necessary number of B-scans, thus increasing its clinical applicability.