Introduction

The placenta is the first organ to form and functions as the fetal lung, gut, kidney, endocrine, and immune systems. As an active participant in gestation, it consumes as much oxygen at term as the entire fetus [1]. Placental pathology causes and reflects adverse events in pregnancy [2, 3]. Pathology in the placenta can have lifelong consequences for mothers and offspring, including increased risk of cardiovascular disease [4], bronchopulmonary dysplasia [5], cerebral palsy [6], colorectal carcinoma [7], and asthma [8]. Therefore, the examination of the placenta can yield considerable benefit. Yet, <20% of placentas are examined in the United States, and significant lesions are frequently unrecognized [9, 10].

Digital pathology has the potential to revolutionize our understanding of placental function and disease [11]. Routine diagnostic pathology relies on qualitative assessment and pattern recognition. Research studies on human placentas usually rely on these assessments or quantitative measurements of selected regions done by hand. A more quantitative, thorough examination may identify new biology and pathophysiology. The sheer volume of archived glass slides of placentas, ~120,000 at our institution alone, with ~500,000 cells in each whole-slide image (WSI), provides an enormous untapped reservoir of material for hypothesis development and testing.

In comparison, clinical examination captures only a fraction of the information from each slide, and the quality is dependent on the examiner. Despite the accessibility of placentas at the time of birth, that information is discarded in most cases. Once an AI system is operating, increasing the scale, adding new populations or diseases is simple. This could include placentas from low-resource or international settings, patients with specific sociodemographic factors, or patients with emerging diseases of pregnancy, like COVID-19.

Changes over time

Over the course of the second- and third-trimesters, the placental disc increases approximately tenfold in size. The most significant microscopic changes are within the terminal villi, with increased numbers of small villi with decreased cellularity, increased stromal density, migration of capillaries to below the syncytial membrane, and collection of syncytiotrophoblast nuclei into knots. These changes have the overall effect of minimizing the distance between maternal and fetal blood [12,13,14]. In analogy with the lung, this results in maximum surface area with minimum diffusion distance for oxygen and nutrients (Fig. 1). Determination of the appropriateness of villous maturation is a key step in assessing a placenta. This task is daunting, as it involves the integration of the factors mentioned above across multiple slides to form a single gestalt. Accordingly, interobserver variability is high [15,16,17].

Fig. 1: Changes in terminal villi over gestation.
figure 1

In the early 3rd-trimester (1, 3), syncytiotrophoblast (ST) nuclei are evenly spaced. Capillaries (C) are distant from maternal blood, which bathes the villi. The stroma consists of loose extracellular matrix proteins with frequent macrophages and fibroblasts (brown and pink stars). At term (2, 4), the villi are smaller. Syncytiotrophoblast nuclei are gathered into knots (K), thinning the vasculosyncytial membrane. Capillaries are directly beneath the syncytiotrophoblast layer. Stroma is denser with lower cellularity.

Gestational age (GA) is the single most important factor in perinatal well-being. The probability of a newborn successfully transitioning from womb to nursery to home increases markedly with GA, and the probability of adverse outcomes including hypoxic-ischemic encephalitis, necrotizing enterocolitis, and bronchopulmonary dysplasia markedly decrease [18]. Accurate identification of GA most commonly relies on sonographic measurements made in the first- or second-trimester [19,20,21]. These measurements may not be available in low-resource settings or when prenatal care is inadequate. Other methods, such as the recalled date of the last menstrual period or sonographic measurements made in the third-trimester, are less accurate.

The placenta and digital pathology

Compared to neoplasia, the placenta is relatively understudied by digital pathology. Studies using photomicrographs of single fields and manual annotation show the potential for scientific discovery using deep, image-based phenotyping of the placenta. Manual measurement of villous and vascular surface area has shown changes over pregnancy [12, 13]. Preeclampsia (PreE) has been associated with changes in villous count, area, diameter, capillary count, and degree of capillarization in the villous core [14]. Gestational diabetes has been associated with decreased villous vascular volume [22]. Abnormal villous maturation has a genetic expression signature—placentas with a diagnosis of accelerated maturation have gene expression more appropriate for placentas delivered 4.7 weeks later with normal maturation [23].

More recent studies support the feasibility of applying modern machine learning and digital pathology techniques to the placenta. Studies have shown the ability to segment villi from scanned slides and measure their stromal density and vessel numbers [24, 25]. Published algorithms exist for identifying cytotrophoblast, fibroblast, macrophage, syncytiotrophoblast, and vascular endothelial cells in the placenta [26].

Deep learning models employing convolutional neural networks (CNN) have shown impressive performance for identifying image content in multiple domains and tasks, including digital pathology [27,28,29,30,31,32]. In training, networks commonly learn to associate a single image or HPF to an outcome or finding of interest. Contrary to CNN’s implicit assumption of one image corresponding to one label, a single WSI contains thousands of HPF with considerable heterogeneity. Practicing pathologists must examine all HPF, attend to fields they consider representative, and aggregate their findings to produce a single diagnosis. The gap between algorithm development and practice reduces the clinical relevance of many AI studies including those in the broader medical imaging field. We propose an algorithm that learns the patient outcome from a collection or set of images in training. This helps to incorporate more regions from each WSI during the learning procedure.

The problem of aggregation extends beyond digital pathology and is present whenever a model receives multiple inputs. Practitioners must decide at which stage of the pipeline data are incorporated, how they are weighted, and the extent to which aggregation is trainable. In non-image tasks, data are routinely input as a single vector allowing complex trainable interactions. Conversely, ensemble strategies may aggregate results from multiple separately trained models without back propagation. Choices in aggregation strategy are liable to be suboptimal if practitioners are unaware that a choice is being made.

This study aims to develop a deep learning model that incorporates and predicts across whole slides and demonstrates the utility of that model in the estimation of GA in placenta—a low concordance task in notoriously heterogeneous tissue.

Materials, subjects, and methods

Patients and materials

Pathology reports from patients delivering 1/1/2010 to 10/31/2019 were retrieved from the laboratory information system (Cerner Build List Id: 2014.08.1.36). GA, clinical history, and diagnoses, including accelerated, delayed, and appropriate maturation, were extracted using regular expressions (6.2) and the Natural Language Toolkit (NLTK, version 3.3) on Python (version 3.6.9) as described [33, 34].

We identified cases with an obstetrically determined GA of 24–42 weeks with an original pathologic diagnosis of appropriate villous maturation, confirmed through a review by a practicing perinatal pathologist (JAG). This GA was considered the ground truth for each case.

Clinical examination of placentas at our institution includes 1 cassette of membranes, 1 of umbilical cord sections, 1 with three incisional biopsies of the placental disc’s maternal surface (basal plate plus villi), 2 cassettes of the representative non-lesional full-thickness placental disc, and additional cassettes containing any lesions. The maternal surface biopsies and full-thickness sections are selected from the inner 2/3 of the radius of the placental disc were reviewed for possible scanning. We selected a slide with morphology consistent with clinically determined GA without mass-forming lesions or villous abnormalities. Given low counts in the earliest GA, we allowed cases with decidual or chorionic plate pathology (e.g., chorioamnionitis).

One slide per patient with villous tissue, either basal villous wedges or full-thickness placental disc, was selected and scanned at the institutional Pathology Core Facility using a Hamamatsu Nanozoomer 2.0 HT scanner at ×20 objective magnification. 154 slides were split randomly, stratified by GA, into training, validation, and test sets with proportions of ~70% (107 slides), ~15% (23 slides), and ~15% (24 slides), respectively. Because deliveries are not evenly distributed across the GA and maturation anomalies are more prevalent at earlier GA, the training, validation, and test sets are not precisely balanced at each GA. A list of cases and corresponding GA is presented in Supplementary Table 1.

Regions of terminal villi with villous maturation consistent with GA were box annotated by the pathologist. Stem villi, areas of fibrin deposition, and septae were avoided. On full-thickness sections, parabasal areas were preferentially annotated. In total, 1918 region annotations (at least 10 per slide) were made. Regions were extracted with OpenSlide (1.1.1) on Python (3.6.9) and were color normalized using the method from Macenko et al. [35]. Regions were tiled into 512 × 512 pixel high-power fields (HPF) at ×20 magnification level and shrunk to 256 × 256 (effective magnification ×10), for a total of 26,555 HPF (Supplementary Table 1). During training, HPF are augmented by random rotations and changes in brightness and contrast [36].

Baseline model

HPF are input into a feature extraction CNN based on VGG19 (30) with trainable weights initialized by a pre-trained model on ImageNet [37] in Keras (Tensorflow 2.3.0). The network is modified by replacing the fully connected layers in the original VGG19 architecture with a single fully connected layer of size 1024 with ReLU activation function and a dropout with a rate of 0.5. The extracted feature map is submitted to the representation learning sub-network, which consists of sequential fully connected layers of size 1024 and 256 with ReLU activation functions and a dropout with a rate of 0.5 after the first fully connected layer, and one linear node at the end to produce a single value—the estimated gestational age (EGA). The mean squared error loss between EGA and clinically determined GA (as the ground truth) is used to train the model. The baseline model was trained for 2000 epochs. To aggregate across a WSI for inference, the median EGA for all HPF is determined post hoc.

GestAltNet–input–glimpsing

In the base model, training explicitly links the clinical outcome to a single HPF. We propose an alternative network for estimating GA, GestAltNet (Figs. 2 and 3). GestAltNet learns in aggregate from a collection of images and relates the clinical outcome to a set of HPF during training. While the baseline model trains using a single HPF as input, GestAltNet uses a glimpse as input in training. Each glimpse consists of 16 randomly selected HPF from a single WSI, generally representing multiple regions. Glimpses are examined in batches of 64 and consumption of all batches represents one epoch. HPF and glimpses are resampled as needed to maintain glimpse and batch sizes. HPF are randomly assigned to glimpses at initialization and after every 50 epochs (chosen based on the performance in the validation set).

Fig. 2: Glimpse and batch formation: Scanned whole-slide images are annotated and ROIs are extracted (left panel).
figure 2

ROIs are tiled into HPFs (2nd panel, black lines). HPFs are randomly sampled without replacement across all ROIs of each patient to form a glimpse (third panel, HPF shading indicates glimpse) second panel from left, colored HPFs indicate their corresponding glimpse. Glimpses are constant size (16) except the last glimpse (purple oval) which takes the remainder. Glimpses from one patient are distributed across batches (fourth panel, gray ovals are glimpses from other patients).

Fig. 3: Model pipeline: Glimpses are submitted as a batch to a convolutional neural network (purple shaded area).
figure 3

Intermediate outputs (red boxes) are input to an attention sub-network. Features maps (f1–fn) are weighted by their attention (a1–an) and aggregated via weighted averaging (oval). The representation learning subnetwork estimates the gestational age (GA) based on the aggregated feature map f. The mean squared error (GA - GA)2 inside a total batch of 64 glimpses is used in backpropagation. The whole learning procedure is done in an end-to-end manner.

GestAltNet–pipeline–attention and aggregation

As in the baseline model, images are input into a VGG19 derived network. The intermediate output of VGG19 at block3, consisting of 256 3 × 3 kernels (Fig. 3, red squares), is input to the attention sub-network. This sub-network is a feedforward neural network with two fully connected layers of size 256, 256 with ReLU activation functions, a dropout with a rate of 0.5 after the first fully connected layer [38], and one linear node at the end. The linear node results in a single scalar value for each HPF in the glimpse, representing its attention. To limit extreme values, attentions are transformed using softmax.

A single aggregate feature map (f in Fig. 3) is obtained through weighted averaging over the feature maps of the 16 HPF within the glimpse, where weights are the corresponding HPF attentions. The aggregate feature map is submitted to the representation learning sub-network as in the baseline network to compute EGA. During training, mean squared error between EGA and clinically determined GA (ground truth) is used as the loss function, and backpropagation is performed end-to-end across the entire network. GestAltNet was trained for 500 epochs. For the whole-slide inference, the median EGA, computed across glimpses, was determined.

Metrics

To assess the overall accuracy, we measured the coefficient of determination (r2) and the absolute error in weeks. For test and unannotated slides, EGA was calibrated using the linear regression of EGA vs. GA for validation regions and whole slides (respectively). We considered an absolute error of >3 weeks as clinically significant because (1) accelerated villous maturation has been diagnosed based on an apparent GA of ≥37 weeks with chronologic GA of ≤34 weeks, i.e., 3 weeks [39]; (2) gene expression study showing accelerated villous maturation equates to 4.7 weeks ahead, and delayed maturation equates to 1.5 weeks behind normal gestation (average 3.1 weeks) [23]; (3) Using the placental weight reference of Pinar et al. [40], a placenta of average weight at one GA is considered large or small for gestational age (LGA, SGA) 3–5 weeks earlier or later. For example, a placenta with the mean weight for 24 weeks, 189 grams, is considered LGA at 21 weeks (expected 114–172 grams) and SGA at 27 weeks (expected 192–305 grams).

Attention and whole-slide estimation of GA

For the whole-slide level inference 36 new slides, neither previously annotated nor part of the training, validation and testing sets were used. The non-tissue area of the WSI was masked out by first applying Gaussian smoothing to the slide’s grayscale thumbnail, and then applying Otsu’s image binarization method to the thumbnail [41]. Attention was determined and GA was estimated on a per-HPF basis for all HPF. To determine appropriate attention thresholds for the selection of representative HPF in WSI level inference, we examined the per-HPF attention and accuracy over the non-overlapping HPF inside the tissue area of the WSI in our validation set. We set the lower threshold at the median attention of HPF with absolute errors of ≤3 weeks and the upper threshold at the 99th percentile of attention for HPF with absolute errors of ≤3 weeks in the validation set.

For generating heat maps, 87.5% overlapping HPF were extracted, and attention and EGA values were produced on a per-HPF basis. Attention was colored with minimum and maximum values scaled based on variation in the validation set. EGA was colored as H&E (appearing pink at low power) for absolute error ≤3 weeks, red if >3 weeks high and blue if >3 weeks low.

This study was approved by the institutional review board (STU00211333). WSI are available upon execution of a data use agreement.

Results

Interobserver variability

29,943 placentas were examined over 9.5 years by eight pathologists. Given a GA determined by clinical parameters, pathologists diagnose whether maturation is appropriate, accelerated, or delayed for the stated GA. Overall, 17,806 (60%) placentas were diagnosed with appropriate maturation, 5108 (17%) with accelerated maturation and 1024 (3.4%) with delayed maturation (Fig. 4). 6005 placentas (20%) received multiple diagnoses, for example, “appropriate for GA with regionally delayed maturation,” or had no description of maturation, which may occur when maturation is obscured by other findings like chorangiosis or post-mortem changes. The percentage of cases diagnosed as normal varied from 51 to 77%, as accelerated from 8.2 to 27%, and as delayed from 0.2 to 13%. Assuming a random distribution of placentas among pathologists, this represents significant interobserver variability.

Fig. 4: Interobserver variability in clinical diagnoses.
figure 4

Despite well-defined patterns of maturation, pathologists are inconsistent in their diagnoses of whether the villous maturation is normal (green), accelerated (red), or delayed (yellow) for the stated gestational age. Each column represents one pathologist.

Deep learning model performance

In the test set, the GestaltNet and baseline models showed r2 of 0.9444 and 0.9220, respectively (Fig. 4a–b). After calibration, the mean absolute error (MAE) was 1.0847 weeks for the GestaltNet model and 1.4505 for the baseline model. An error of ≥3 weeks is significant in evaluating GA. By this standard, both the GestaltNet and baseline models adequately estimated GA 24/24 test cases (Fig. 5).

Fig. 5: Test results.
figure 5

a In the test set, the baseline model shows an r2 of 0.9220 with a mean average error (MAE) of 1.4505 weeks. b The GestaltNet shows an r2 of 0.9444 with an MAE of 1.0847 weeks.

Attention and estimation of GA across whole slides

The GestaltNet technique simulates a pathologist’s cognitive process of incorporating information across multiple regions of interest. However, it still relies on hand-annotated regions of interest selected to include representative, high-quality areas of tissue. To explore variation across tissue and emulate the pathologist attention and gestalt formation process across the whole slide, we obtained attention and EGA across 36 WSI that were unannotated and not part of the existing training, validation, or test sets. This resulted in an r2 of 0.8859 and an MAE of 1.3671 weeks. The model estimated GA was within 3 weeks of the actual GA in 35/36 (97.22%) cases (Fig. 6). To illustrate and further examine how WSI attention and prediction relate, we generated whole-slide attention and predictions for one WSI using overlapping HPF (Fig. 7). Perhaps surprisingly, given that we did not train our model to discriminate between different regions of the placenta, terminal villi show the highest attention, while stem villi, basal plate, and chorionic plate showed lower attention. GA estimation was variable within the villous region; however, the most accurate areas tended to be away from large stem villi or other masses. Some non-villous areas, including chorionic vessels, are attended to with divergent and inaccurate predictions.

Fig. 6: WSI Level Test Results on Non-Annotated Set: In this set of not previously seen slides, the model estimates GA with an R2 of 0.8859 with an MAE of 1.3671 weeks.
figure 6

35 of 36 cases were called correctly within ±3 weeks (red lines).

Fig. 7: Example whole-slide attention (top left, detail—middle row) and prediction (top right, detail—bottom row).
figure 7

Terminal villi are primarily high attention (yellow, regions 1, 5, and 6). Basal plate (left side of WSI and region 2), stem villi (region 3, intermixed with villous areas) and chorionic plate (right side of WSI and region 4) are generally low attention (purple). Estimated gestational age shows variegation with accurate areas (region 1) intermixed with areas with inaccurate low (blue, region 2) and high (red, region 3) estimates. Areas with low attention are disregarded (grayscale). The model is not explicitly trained to recognize tissue types and shows erroneous high attention to some areas. For example, one chorionic plate vessel (region 4) is part high- and part low-attention. The attended part of the vessel wall gives an estimate that misses low. Intravascular blood is attended and misses high.

Discussion

GA is the most significant factor in neonatal well-being. However, practicing pathologists rely on GA derived from other factors and show considerable inter-rater variability even in identifying whether the villous appearance is appropriate for the stated GA. We show that GA can be predicted with extraordinary accuracy from the beginning of viability (24 weeks) to post-term (42 weeks) using a deep learning approach. In practice, pathologists examine several regions across multiple whole slides, looking for different features that are either concordant or discordant with the chronological GA.

Developing a model for this task requires a solution to what we call “The Problem of Aggregation.” Our solution is to analyze multiple HPF in a glimpse. Aggregation occurs at the feature map stage. Feature maps are weighted based on the attention generated by an independent multilayer perceptron. The model takes the form of a single end-to-end network in which all sub-networks are trainable. We show that the integration of image features at an early stage with weighting and end-to-end trainability provides superior accuracy compared to post hoc averaging used in the baseline model. The improvement is highlighted by the stress test of calculating EGA without the regularization provided by human annotation.

One of the characteristics of deep learning algorithms that has made them so successful in digital pathology is their end-to-end learning approach. These adaptive algorithms learn to predict labels directly from pixel values in contrast to prior approaches that seek to incorporate a-priori knowledge in algorithm design. The unbiased end-to-end learning method is often credited as enabling deep learning models to learn latent predictive features in histology that may not be appreciated by human pathologists, but at the cost of algorithm interpretability.

End-to-end learning becomes practically difficult when labels correspond to an entire slide or a large region rather than a high-power field due to the scale of data corresponding to a single label and the limitations of computer hardware used to train deep learning algorithms. In this scenario, end-to-end learning requires that the mechanism for aggregating over multiple fields be incorporated into the learning model and be adaptive. In applications like tumor detection, a single positive field gives the whole-slide label, and have been solved using approaches like multiple instance learning. Other applications may be more compositional, requiring the interpretation and weighting of several tissue patterns, or learning to perform a weighted averaging over regions of the slide.

This paper provides a solution involving exhaustive random sampling of HPF representing a single case with the differential weighting of HPF by attention. This strategy is broadly applicable to any scenario when large amounts of data are consumed for each sample. However, it is particularly relevant for image analysis, where the interpretation of one portion of the image depends on context from other portions. For example, a pedestrian waving to another pedestrian on the other side of a street is more likely to enter the street than one waving to a departing car. In pathology, injured liver adjacent to a liver tumor represents mass effect, not cirrhosis. GestAltNet assigns attention weights on a per-HPF basis. This reflects the variability in information content between HPF, even within human-annotated ROI. Within-image attention, for example Grad-CAM, has been proposed to address the problem of interpretability in AI [42]. Theoretically, our attention could be used in a similar fashion, analogous to the use of dotting pens in pathology practice to annotate key areas for diagnosis. Within-image attention has been criticized for focusing on edges or complex structures and using similar patterns of attention to explain correct and incorrect answers [43]. It is not clear that a by-HPF system, such as GestAltNet, is immune from this problem, and the observation that it assigns similar attention to correct, miss-high, and miss-low regions (Fig. 7) is concerning.

Our choice of a single end-to-end network is also appealing in that it reflects human cognition, and all operations are potentially trainable. This mimics human thought patterns of aggregating impressions rather than diagnoses. Features may also be a more worthy area of focus as they are representations of biological phenomena, while HPF is arbitrary grids imposed by computer memory limitations.

Other authors have addressed the aggregation problem in the placenta with success. Clymer et al. use the multiple-resolution pyramid of images found in scanned slide files to identify vessels within placental membranes followed by clustering to produce a slide-level diagnosis as either containing healthy or pathologic maternal vessels [44]. However, this study did not use end-to-end training.

The future

This is among the first studies using machine learning in placental pathology and demonstrates the potential of this field. The extremely high accuracy in detecting normal morphology across gestation will allow the classification of many abnormalities, some currently unknown or with too low interobserver reliability to be useful.

In high-resource settings, GA is usually determined by first-trimester ultrasound. The system demonstrated is unlikely to replace this method but could be useful in cases where the dating of the pregnancy is unclear, or there is a discrepancy between the stated and apparent GA. In low or middle-income settings, photomicrographs of relevant areas taken using a smartphone and adapter could be used in lieu of WSIs [45]. In this use-case of human-machine cooperation, the small size of captured images means that a cloud-based network could provide estimated GA in real-time.

Accelerated and delayed villous maturation are among the most commonly reported placental findings in large data sets [33]. Nonetheless, they show poor inter-rater reliability, decreasing the significance of these findings. AI could be used in a quality assurance/improvement paradigm to improve interobserver variability in practice and is likely useful in identifying maturation abnormalities.

Our solutions to the problem of aggregation, as used in GestAltNet, will have applications far beyond the placenta. Intratumoral heterogeneity complicates neoplasia classification and is a marker for adverse outcomes [46,47,48]. In other non-neoplastic diseases, such as idiopathic pulmonary fibrosis, heterogeneity itself may be a criterion [49]. Beyond digital pathology, attention and aggregation within large and complex images remain fundamental challenges of image analysis.

Limitations

From a generalizability standpoint, the most significant limitations of this work are the use of a single site with consistent protocols and a single pathologist reviewer. Further work is necessary to develop and demonstrate generalizability across institutions and practitioners. Our demonstration of interobserver variability is limited in that pathologists are not reviewing the same placenta, but rather placentas submitted more or less randomly from the same population. The remainder of this work suggests that human-machine collaboration to overcome this variability will be more productive than perseverating on the precise degree of heterogeneity.

Conclusion

In conclusion, we report the machine learning-based estimation of GA from scanned histologic slides of the placenta. This demonstrates the tractability of this system and may be useful in diagnostic, quality, and research settings. We present a novel aggregation and attention model to manage and utilize the vast quantity of data present in whole slides.