## Introduction

The fetal brain undergoes dramatic morphological and architectural changes within a short timeframe. Accurate understanding of key milestones in fetal brain maturation is critical for assessing range of normal development and long-term cognitive outcomes1. Previous studies have established an approximate spatiotemporal timetable of healthy fetal brain development, outlining the progressive gyrification of the cerebral cortex starting in the mid-second trimester2,3,4,5. Depending on severity, deviations from this pattern have been associated with developmental delays, psychomotor retardation, and failure to thrive6. The link between gestational age and cortical folding lays the foundation for neuroimaging-derived age predictions.

A growing body of neuroscience research has managed to leverage multiple imaging modalities to accurately predict the “brain age” of individuals using machine learning7,8,9. These algorithms learn the relationship between neuroimaging features and corresponding ages, after which they are tested on unseen data. Assuming model accuracy, discrepancies between estimated brain age and actual chronological age might suggest developmental brain pathology10. However, most studies to date have focused primarily on degenerative diseases and trauma in adults11,12,13,14. Fetal brain-based age estimation remains a major research gap and holds profound implications for obstetric prenatal care, delivery planning, and postnatal outcomes9,15,16.

The current method of choice for evaluating fetal brain maturity involves initial ultrasonography (US) of the cerebral cortex17. However, US can be severely limited by technical challenges and patient factors including maternal obesity, suboptimal fetal positioning, and oligohydramnios18. In addition, US-guided gestational dating in the second and third trimesters can err by up to 2 and 4 weeks, respectively19. In utero MRI has emerged as an important adjunct to US, offering detailed resolution of cortical gyration and myelination20. Nevertheless, rapid and ongoing neurodevelopmental changes, low signal-to-noise ratio, tissue contrast, and geometric distortions of small fetal brain embedded within the maternal structures pose obstacles to fetal neuroimaging. Fetal motion is also random, spontaneous, and possible in all planes, rendering even fast single-shot sequences challenging21,22. Furthermore, fetal brain MRI protocols, imaging platforms, and operator experience differ widely across institutions, leading to inconsistency in image quality and interpretation23.

Deep learning algorithms offer a powerful means to solve complex tasks such as fetal age estimation from highly variable imaging data12,24,25,26. Recent efforts have employed deep learning techniques on fetal brain MRI to infer gestational age, achieving moderate to high prediction accuracies27,28. However, these studies do not demonstrate a large or diverse enough sample to claim sufficient robustness or scalability28,29. The performance of some of these convolutional neural networks (CNN) also depends on manual brain segmentation, which can be time-intensive, poorly generalizable, and sensitive to artifacts, particularly in fetal imaging30. To address these problems, we proposed a self-attention framework to improve brain localization and the use of input images in multiple planes to maximize image diversity. We developed and tested several fully automated CNN architectures on a large heterogeneous single-center fetal MRI dataset. Finally, we tested the accuracy of age prediction when applied to data from several other centers of excellence in fetal imaging.

## Results

### Stanford cohort

A total of 741 T2-weighted MRI scans corresponding to unique patients (median gestational age 30.6 weeks, range 19–39 weeks) were included. Coefficient of determination (R2) and mean absolute error (MAE) for each model architecture tested are presented in Table 1. For each MRI plane, diminishing performance was seen with more than 3 input slices. Between the two age prediction approaches, averaging the outputs from the global branch and attention-guided local branch generated higher R2 scores and smaller MAE compared with predictions based on global images alone. The highest performing single-plane model was the attention-guided, 3-slice, coronal-view model with an R2 of 0.924 and corresponding MAE of 7.9 days.

Integrating information from the three planes achieved a notable improvement in model regression performance. A visualization of model regression performance is shown in Fig. 1. The concatenated multi-plane network produced the most accurate gestational age predictions out of all models tested, with the 3-slice architecture slightly outperforming the 1-slice model (R2 = 0.945 vs. 0.935; MAE = 6.7 vs. 7.3 days). The agreement between prediction and ground truth for this model was substantial based on Lin’s concordance correlation coefficient (ρc = 0.970; 95% CI 0.961–0.978). The modified Bland–Altman plot shows slight age overestimation up to about 34 weeks, after which the model progressively underestimates gestational age across quantile curves.

### External sites

The attention-guided, multi-plane ResNet-50 models trained on Stanford data were tested on external data obtained from four centers of excellence: Children’s Hospital of Los Angeles (CHLA), Cincinnati Children’s Hospital Medical Center (CCHMC), St. Joseph Hospital and Medical Center (SJH), and Tepecik Training and Research Hospital (TTRH). Without transfer learning, the 1-slice and 3-slice models achieved R2 of 0.690–0.861 and 0.523–0.857 and MAE of 9.2–16.0 days and 10.3–21.0 days, respectively. As shown in Table 2, both models demonstrated notable improvement after fine-tuning (ΔMAE = − 0.7 to − 4.1 days and − 0.5 to − 4.6 days). Combining all datasets, the 1-slice model achieved higher Lin’s concordance correlation coefficient than the 3-slice model, but the difference was not significant (ρc = 0.920 [0.903–0934], vs. 0.895 [0.874–0.913]). The most generalizable models were the fine-tuned 1-slice model for CHLA, SJH, and TTRH and 3-slice model for CCHMC, with R2 of 0.81–0.90, MAE of 8.4–12.9 days, and moderate ρc of 0.90–0.94.

## Discussion

In this study, we present an end-to-end, automated deep learning architecture that accurately predicts gestational age from developmentally normal fetal brain MRI. Our highest-scoring model performed at R2 of 0.945 on the Stanford test set, comparable or superior to published child, adolescent, and adult brain age prediction CNNs8,10,24. Previous works in fetal brain-based age analysis using MRI have primarily been limited to the development of spatiotemporal atlases for comparative age estimation and morphological segmentation31,32,33. Importantly, these methods help characterize fetal brain development and normal variability within the population9. However, most studies are restricted to a relatively small database, narrow age range, or isolated anatomical region (e.g., cortex, ventricles, hippocampus)31,34,35,36. These limitations reduce the generalizability of age-specific templates and reveal an important gap in our understanding of normal fetal brain maturation.

Variability in imaging quality presents another significant challenge for assessing fetal development. Challenges to interpretation include the rapidly changing neurological features in utero as well as the technical complexity of imaging17,21. Fetal MRI is notoriously complicated by the low signal from small fetal organs and relatively noisy background due to spontaneous fetal motion and maternal soft tissues (see Supplementary Fig. S1)37,38. One study showed that a deep learning segmentation model achieves high Dice overlap scores (96.5%) on clean datasets but low performance on images with motion artifact or abnormal fetal orientation (78.8%)30. This discrepancy highlights the importance of leveraging heterogeneous datasets to train and fine-tune deep learning networks. Accordingly, we reviewed all normal fetal MRIs at Stanford from 2004 to 2017 and excluded images only if severe imaging artifacts rendered them nondiagnostic. Our database of 741 images thereby enabled us to capture broad within-institution imaging variability and outnumbers datasets previously used to develop spatiotemporal atlases9,31,32,33,39.

More recent deep learning methods have utilized attention guidance in conjunction with object segmentation to improve noise resiliency40,41. Shi et al.28 built an attention-based deep residual network based on 659 pre-segmented fetal brains, achieving R2 of 0.92 and MAE of 0.77 weeks. Their use of attention activation maps emphasized global and regional features, such as cerebral volume and sulcal contours, within pre-processed segmentations to enhance prediction accuracy. However, this staged deep learning approach relies on the careful delineation of fetal brain masks, a time-intensive process that the authors report taking 30–40 min per sample. Since age regression depends on accurate object masking, external generalizability may be limited, as any fine-tuning would require manual segmentation by a trained researcher with domain knowledge. In contrast, we employ the attention mechanism to automatically focus on the fetal brain itself, enabling a higher signal-to-noise ratio by excluding unrelated features such as the maternal organs and other fetal body parts and reducing non-uniform MR intensity. Furthermore, both attention-guided masking and age regression are trained simultaneously and recursively, obviating the need for extensive pre-processing and fine-tuning. Our best-performing model was thereby computationally efficient and scalable, completing its regression task within 5 min at a GPU level.

The real-world utility of any deep learning model largely depends on its generalization performance. For fetal MRI in particular, standard imaging protocols, quality of imaging, sequences used, and operator experience differ widely across institutions23. Performance losses incurred when transferring models from one institution to another has become a major concern in the machine learning field. In this study, we test multi-center generalizability of our automated deep learning network using a large external database spanning four centers of excellence, two countries, and a wide array of imaging platforms, scanner hardware, and acquisition parameters (Table 3). There were visible differences in image appearance when comparing datasets across different sites due to factors such as resolution, contrast, and signal-to-noise ratio (see Supplementary Fig. S2). Accordingly, our Stanford-trained multi-plane models yielded varying degrees of performance reduction on the external datasets. However, fine-tuning the model with just 20% of the external data enabled the network to adapt to the new cohort, highlighting its potential applicability across institutions and imaging platforms. Meaningful improvements in R2 score, MAE, and age concordance were achieved across institutions after fine-tuning and may continue to be observed using larger validation datasets.

Fetal MRI not only offers insight into prenatal development, but can also guide laboratory work-up, therapeutic interventions, counseling, and delivery planning23. At present, the reported date of last menstrual period and first-trimester US measurements are “gold standard” methods for determining gestational age19. However, inaccurate recall of the last menstrual period, confounding factors (e.g., irregular spotting or ovulation), and US variability in the second and third trimesters have propelled the need for alternative gestational dating approaches42. In our study, fetal brain MRI scans interpreted as normal based on expert consensus were used to develop a convolutional neural network that was highly predictive of gestational age, offering a potential solution for age estimation in the second half of gestation. Our end-to-end approach to assessing the fetal brain also obviates the need for manual feature engineering or segmentation, enabling real-time interpretation. Moving forward, this model may serve as a backbone for evaluating gestational age as well as deviations from normal development, such as underdevelopment, malformation, and other congenital diseases6,9. Furthermore, emerging deep learning techniques in image reconstruction43 offer promise for developing population-based spatiotemporal atlases to better characterize age-based fetal neuroanatomy.

There are several limitations to this study. As a 2D CNN, age predictions are made based on single-slice inputs, potentially limiting the information available to the network. A 3D CNN incorporating multi-slice imaging features may improve model performance but would require a much larger dataset and risk greater background noise. Our approach to enhance regression accuracy involves Gaussian weighting of the attention heatmap, optimized for images centered on the fetal brain. Extreme position and size variability thereby reduces the accuracy of attention-guided mask inference but not necessarily regression performance as shown in Supplementary Fig. S3. This may be explained by the inclusion of both local and global branches, incorporating semantic features from the emphasized subregion as well as the entire image, respectively. A drawback of this approach is the inclusion of unwanted background noise when the localization procedure performs optimally.

Notably, beyond 34 weeks, our model appears to underestimate gestational age. This trend can be partially attributed to dataset imbalance with few fetal MRI performed in the late third trimester, biasing predictions toward younger gestational ages. US and MR imaging studies also indicate that peak gyrification occurs between weeks 29–35 and that most of the primary and secondary sulci along with all notable gyri have formed by weeks 34–374,18,44. A decreasing gyrification rate approaching full term may also skew age estimates, as fetal brains appear more homogenous as they near maturity. Future work can extend the training set to include fetal MRI at age extremes and explore emerging methods such as feature distribution smoothing for imbalanced data with continuous labels45. In terms of generalizability, our model may also benefit from the inclusion of external data in the original training set to reduce over-fitting. Finally, a machine learning model is only as reliable as the quality of its input data. Long-term clinical and developmental outcomes for our cohort are unavailable, so scans used to train and test our model are only “normal” from a neuroanatomical perspective.

## Conclusion

Deep learning has emerged as a powerful approach for interpreting complex image features. We present an attention-guided, multi-view deep learning network that analyzes MRI-based features of the normally developing fetal brain to accurately predict gestational age. We further demonstrate model performance on external sites and the utility of fine-tuning the model for enhanced generalizability. This study identifies opportunities for imaging-driven analytics of in utero human neural development with potential to enhance diagnostic precision in the second and third trimesters.

## Materials and methods

### Stanford data collection and cohort description

We retrospectively reviewed all 1927 fetal brain MRIs performed at Stanford Lucile Packard Children’s Hospital from 2004 to 2017, as described in Supplementary Table S1. 1.5 T and 3 T MRI data were acquired with an 8-channel head coil on Signa HDxt, Signa EXCITE, Optima MR450W, and Discovery MR750W scanners (GE Healthcare). 572 images containing cerebral malformations, ventriculomegaly, or other acquired or congenital brain lesions were excluded. 422 nondiagnostic images with severe motion artifacts or noise preventing adequate interpretation were also omitted. In total, we compiled a database of 933 fetal brain MRIs, interpreted as developmentally normal by expert pediatric neuroradiologists. MRI interpretations were based on visual features and biometry measurements such as brain biparietal diameter and skull occipitofrontal diameter. 741 studies had single-shot fast spin-echo T2-weighted sequences in all three planes (axial, coronal, and sagittal). The single-shot images, originally in DICOM File Format, were compressed to JPG files for visualization. The image slices near the middle of the sequence were pre-processed and augmented as the input. Slices were randomly cropped to 224 × 224 and normalized using sample mean and standard deviation. These data were randomly split into training (70%), validation (10%), and test (20%) sets for model input.

This study was approved by Stanford University’s Institutional Review Board (IRB). Data collection and analysis were performed in accordance with relevant guidelines and regulations. Written informed consent was obtained from all pregnant women or authorized representatives for imaging of fetuses prior to delivery (IRB protocol #42137).

### Model structure

The model architecture consists of two parallel branches, the global and local branches, as shown in Fig. 2. Both the global and local branches consist of deep residual neural networks that are optimized to predict gestational age based on fetal MRI. ResNet-50, a CNN pre-trained on more than a million images from the 2012 ImageNet database, was used as the backbone deep neural network for age regression46. For each stack of input image slices, we assumed the middle slice to contain the largest fetal brain area. We then tested the effect of the number of image slice inputs on model performance (e.g., 1, 3, 5), incorporating additional slices immediately adjacent to the middle slice. The first convolutional layer of the ResNet-50 model was parameterized to accommodate different numbers of image slices with their corresponding input channels. Pretrained model weights were then applied to subsequent layers of the network. Given input image(s) X, the global branch is first trained using the entire or ‘global’ X. Then, the region of interest is masked using an attention mechanism with Gaussian weighting and trained for age regression on the local branch. Learned features from both branches simultaneously optimize final age prediction. Independent models were trained on axial, coronal, and sagittal images to study the unique semantic features from different planes.

We compared two approaches for predicting gestational age ypred: global branch predictions (i.e., entire image) without the attention-guided local branch, versus averaged age predictions from both the global branch and local branch (i.e., masked region of interest). The true gestational age ytrue or ‘ground truth’ was determined via the standard-of-care approach of estimating the date of delivery based on an early obstetric ultrasound in the first trimester19. Gestational ages at time of US were recorded directly from the reports, and differences in MRI and US dates were added to obtain ytrue for each patient. In the training phase, the model is optimized by stochastic gradient descent with backpropagation to minimize the mean squared error (MSE) loss between true and predicted ages$$y_{true} - y_{pred\;2}^{2}$$ 47.

Computational analysis of fetal MR imaging is extremely challenging due to the random position and rotation of fetal brains across patients. Additionally, noise unrelated to the fetus (such as the maternal placenta and organs) may negatively affect predictive performance. These considerations motivated the use of attention-guided mask inference, which provides spatially variant maps that highlight regions of interest and contribute to accurate object recognition48.

As previously described in Guan et al.49 and Zhou et al.50, the attention heatmap is extracted from the last convolutional layer in the global branch. Given an initial input image X representing the whole image slice, $$f_{k} \left( {x,y} \right)$$ represents the activation of spatial location $$\left( {x,y} \right)$$ in the kth channel of the output of the last convolutional layer, where $$k \in \left\{ {1, \ldots ,\left. K \right\}} \right.$$ and K is the total number of feature map channels ($$K = 512$$ in ResNet-18, $$K = 2048$$ in ResNet-50). The attention heatmap values $$H_{g}$$ are computed by maximizing activation values across channels:

$$H_{g} \left( {x,y} \right) = \max \left( {\left| {f_{k} \left( {x,y} \right)} \right|} \right), \quad k \in \left\{ {1, \ldots ,\left. K \right\}} \right.$$

After up-sampling $$H_{g}$$ to match the resolution of the input images, we apply the truncated ReLU activation function to normalize the heatmap $$H_{g}$$ to the data range of [0, 1], where larger values represent increasing probability of detecting fetal brain tissue. High-value areas are subsequently given more attention by the prediction model. Furthermore, with the prior knowledge that the fetal brain usually localizes in the center of the image, we multiply a 2D Gaussian mask to re-weight the heatmap. Thereafter, the heatmaps highlighting the region of interest (i.e., fetal brain) are generated. Examples of heatmaps are shown in Fig. 3.

Heatmap weights are multiplied with the input image to obtain a masked region of the fetal brain, suppressing background noise in the original scan. The re-weighted image is then inputted to the local branch for age prediction based on regional features. Since we automatically extract the heatmap from the global branch and the normalization operations are differentiable, the entire model framework can be trained end to end for adaptive attention map weighting and brain age estimation.

### Multi-plane learning approach

A multi-plane learning approach was employed to capitalize on complementary information contained in different MRI dimensions. Separately from the single-plane architectures, we trained a multi-plane model by minimizing the total MSE loss involving axial, coronal, and sagittal planes. Network weights are thereby optimized based on features from all MRI views simultaneously. After convergence, prediction outputs from each plane are then averaged for a final estimation of gestational age.

### Training and evaluation

All network architectures were implemented with the PyTorch framework51. We trained the models using the Adam Optimizer with a learning rate of 1 × 10−4 and a batch size of 50 for 2000 iterations. The training session was conducted on a NVIDIA TITAN Xp GPU. High scoring models were defined as those with strong correlation and concordance between true gestational age and predicted gestational age. Correlative strength was evaluated for all models trained and tested on Stanford fetal imaging data by the R2 and MAE. Concordance between predicted and true gestational ages was determined using Lin’s concordance correlation coefficient, with strength of agreement assessed by McBride’s criteria as follows: poor, < 0.90; moderate, 0.90–0.95; substantial, 0.95–0.99; almost perfect > 0.9952,53. Statistical results were visually confirmed by local piecewise regression analysis using a window size of 15 points, 95% overlap between windows, and Gaussian smoothing54.

### Validation with external sites

External MRI data were obtained from four additional centers of excellence: Children’s Hospital of Los Angeles, Cincinnati Children’s Hospital Medical Center, St. Joseph Hospital and Medical Center, and Tepecik Training and Research Hospital in İzmir, Turkey. MR imaging across sites varied widely in terms of scanning platform, sequence types, and technical settings, as shown in Table 3. To test generalizability, the attention-guided multi-plane model (i.e., highest-scoring network tested on Stanford data) was used. The 1-slice and 3-slice architectures were compared across external institutions. After deploying the same data curation methods used for Stanford data, the external datasets consisted of 156, 64, 25, and 189 fetal MRI samples for CHLA, CCH, SJH, and TTRH, respectively (Supplementary Fig. S2). The Stanford-trained model was first tested directly on these unseen external samples without any transfer learning. We then fine-tuned the model with 20% of each dataset using the Adam optimizer with a learning rate of 1 × 10−5 and a batch size of 5. For SJH, we used a learning rate of 1 × 10−6 as only 5 data samples were available for fine-tuning. We employed early stopping at 5 epochs to avoid overfitting. Performance with and without fine-tuning on the remaining 80% of each dataset was compared.