Introduction

Mammalian embryo polarization is the process by which all individual cells of the embryo establish an apical domain on the cell–cell contact-free surface. In the mouse embryo, this process occurs at the late 8-cell stage, on the third day of development after fertilization, (Fig. 1a) and in humans on the fourth day at the 8–16 cell stage1,2,3,4,5,6,7. The apical domain is composed of the PAR complex and ERM proteins (Ezrin, Radixin, Moesin), enclosed by an actomyosin ring2,8,9,10,11,12. The cells which inherit this apical domain after division will become specified as trophectoderm (TE), which ultimately forms the placenta. In contrast, those cells that do not inherit the apical domain will form the inner cell mass (ICM), which will give rise to all fetal tissues and the yolk sac4,5,6. Thus, embryo polarization provides the first critical bifurcation of cell fates in the mammalian embryo, and establishment of cell lineages in the blastocyst, which is crucial for implantation and a successful pregnancy. In agreement with this, preventing cell polarization of the mouse and human embryo prevents its successful development3,4,11,13.

Figure 1
figure 1

Method to track and annotate polarity. (a) Overview of mouse pre-implantation development, from the zygote stage at embryonic day 0 to the late blastocyst stage at embryonic day 4.5. At the late 8-cell stage, polarization takes place, as each blastomere gains a defined apical-basal axis of polarity indicated by the presence of an apical domain (red). (b) Data preprocessing of dual-channel 3D mouse embryo videos, each of which is a 5D tensor with the dimension of x, y, z, c (channel), and t (time). First, each video was split into a fluorescence (Ezrin-RFP) and DIC channel, visualized in red and gray respectively. Then, each channel was compressed along the z dimension by different algorithms. The maximum intensity z-projection algorithm was applied for the fluorescence channel and DTCWT based AIF algorithm for the DIC channel to get the frame sequences. (c) Expert annotation on fluorescence frame sequences, where the time point of polarity onset is pinpointed. In the time sequence, the onset of polarization was defined as the frame in which the blastomere had a clear polarity ring or cap (closed) which took up at least 1/3 of the visible surface, or 1/3 of the cell surface curve if displayed side-on. Frames before this point were defined as before-onset, whilst frames including and after this point are defined as after-onset. (d) Supervised learning of a single DCNN model. The DIC frame sequences paired with the class labels from fluorescence annotation were permuted and used as the input and target of the supervised learning. Transfer learning from pre-trained weights on ImageNet database and data augmentation are utilized in the training of all DCNN models. Scale bar = 30 μm.

Given the importance of polarization, an ability to detect this developmental feature non-invasively would be beneficial, for example, for the screening of viable human embryos for implantation. However, all current methods for detecting polarization are invasive as they rely on modifying embryos to express fluorescently tagged proteins that mark the apical domains14,15. Such fluorescent tagging of human embryos meant for implantation is impermissible, which currently prevents clinical embryologists from utilizing polarization to evaluate the quality of human embryos for transfer to mothers-to-be.

Tracking polarization without the use of fluorescence could be solved using deep learning, which is able to discern salient features that may be unintuitive for humans16,17. Indeed, deep learning has been recently used successfully to automate detection of an embryo’s morphological features and applied on single time-point images to assess implantation potential of human embryos16,17,18,19,20,21. These deep learning approaches either provide a means of accurately counting cell numbers with developmental time16,17; relate embryo morphology to subjective developmental criteria assessed by embryologists18,19; or provide independent assessment of morphological features19,20. One study has related preimplantation morphology with successful development of fetal heartbeat21. However, the morphological features being assessed by the deep learning algorithms used to date are generally not clear. In addition, these current approaches do not rely upon known critical developmental milestones in the molecular cell biology of preimplantation development. Here, we have used time lapse movies of fluorescent markers of polarization in the developing mouse embryo to train a deep learning system to recognize the polarization events in the corresponding bright field movie frames with a high degree of success. This is the first time that deep learning has been applied to recognize a specific molecular cell biological process in an embryo that is key for developmental success.

Results

Collection and annotation of embryo images

In order to develop our deep learning model for detecting the polarization status of live and unstained embryos, we first required a large dataset of DIC embryo frames for which the polarization is unambiguously evident. Generating this dataset required each DIC image to have a corresponding channel that uses fluorescently tagged proteins to indicate polarization for each embryo clearly. The polarization of a single blastomere in the embryo can be determined by the localization of apical proteins, which are enclosed by an actomyosin ring9,10,11,22. We built a large dataset composed of synchronized DIC and fluorescence channels of mouse embryos during polarization by collecting mouse embryo time-lapse recordings, each containing a fluorescence channel to indicate embryo polarization, and a DIC channel for model training and testing (Fig. 1b). For time-lapse recordings, embryos were injected at the 2-cell stage with synthetic mRNA for Ezrin tagged with red fluorescence protein (RFP), as previously23, and cultured in vitro to the 16-cell stage. We used Ezrin as a marker for blastomere polarization, as Ezrin localizes to the apical surface during the formation of an apical polarity domain12,24. Using the Ezrin-RFP fluorescent channel, we determined the time point at which the first blastomere of the embryo polarized for each time-lapse recording, indicated by formation of a clear apical polarity cap (Fig. 1c, Supplementary Fig. 1). Using this annotation, each DIC frame was labelled as either before or after the onset of polarization (Fig. 1d). In total, we produced a dataset containing 89 embryo time-lapse recordings of the 8-cell stage embryo during their polarization.

Compression of 3D embryo image sequences

In previous studies, a single slice image along the z axis was used for model input16,17,18,19,20,21 due to the use of existing deep learning models designed for a two-dimensional (2D) image input. However, a single z-slice image does not capture 3D embryo structural information. Analysis of a 3D image stack with deep learning requires a re-designed model architecture that dramatically increases the complexity and time required for model development25,26. Moreover, adapting existing pre-trained deep learning networks for 3D analysis through transfer learning27 would not be straightforward as these networks are predominantly designed for 2D image recognition tasks. To resolve this problem, we utilized a state-of-the-art all-in-focus (AIF) algorithm based on dual-tree complex wavelet transform (DTCWT)28 to compress the optically sectioned z stack of each DIC frame in our dataset. The result was a single 2D AIF DIC image capturing the majority of relevant but sparsely distributed 3D embryo information at each time point (Fig. 1b).

We found that AIF images based on DTCWT could reveal all blastomeres of a 3D embryo in a single 2D image (Supplementary Fig. 2). In contrast, the median z slice typically contained several blastomeres that were optically out of focus, resulting in lost information. AIF images also resembled standard images, allowing for straightforward transfer learning using open-source 2D image classification models pre-trained on ImageNet29 as initialization.

Model architecture

The dataset consisting of AIF DIC images paired with corresponding annotated polarization labels was randomly split into a training cohort of 70 embryos (1889 frames) and a testing cohort of 19 embryos (583 frames) (Fig. 2a). These were used as learning and evaluation datasets, respectively, for a single deep convolutional neural network (DCNN) binary classification model. For supervised learning of DCNN models, we retained only information about whether a frame was before or after onset and stripped away other time information (Figs. 1d, 2b). On individual testing frames, each DCNN model outputs whether or not polarization was detected as a vector containing two probabilities—one for each class (before or after onset, Fig. 2b). To mitigate over-fitting, we ensembled six DCNN models trained using different initializations and different optimizers but trained over the same number of epochs. The final polarization status prediction for a single input image is the class (before or after onset) having the highest average probability across all six contributing models. Overall, our model accuracy increased from an average of 82.6% for a single DCNN to 85.2% with ensemble learning.

Figure 2
figure 2

An ensemble deep learning approach to predict embryo polarization from DIC images. (a) Class distribution in the training/testing/whole dataset. (b) Ensemble learning on six DCNN models. The predicted probability vectors for two classes on a single testing frame by six DCNN models were averaged element-wisely and the class corresponding to the larger probability was used as the final predicted label. (c) Temporal smoothing on the predicted labels for each testing embryo’s DIC frame sequence. The majority voting based smoothing window slid over the chronologically ordered binary labels. The window length is 3 and we kept the label at both ends untouched. Finally, the time index of first after-onset prediction was taken as the final prediction of polarity onset time point. Scale bar = 20 μm.

Ensemble deep learning model outperforms human volunteers

We recruited six volunteers following the criteria outlined in the Methods section, to compare polarization detection accuracy against our model. We aimed to recruit human volunteers from a STEM background, who would be motivated to benefit from the technology in a clinical setting and who might compare favorably with our machine learning system. The volunteers were self-trained using the same annotated training dataset used by our model. They were then given the same AIF DIC testing dataset and asked to determine the polarization status for each test image (before or after onset).

The model we establish here yielded a classification sensitivity of 90.2% (95% confidence interval (CI): 86.1%–93.8%) and specificity of 81.1% (95% CI: 76.2%–85.4%) for single image inputs, with areas under the receiver operating characteristic curve of 0.893 (95% CI: 0.866–0.917) (Fig. 3a and Supplementary Table 1). Deep learning achieved both a higher true positive rate and lower false positive rate than the average human volunteer. Figure 3b shows the confusion matrix for predictions. Our model correctly classified 497 out of 583 frames, resulting in a classification accuracy of 85.2% (95% CI: 82.2%–88.2%). In comparison, the average human accuracy on the same testing frames was 61.1% (95% CI: 57.1%–65.0%) (Fig. 3b). The model outperformed humans on average (Fig. 3c, two-tailed z-test, p < 0.0001) as well as individually (Supplementary Fig. 3).

Figure 3
figure 3

Results of image classification task by the ensemble deep learning model and the average human. (a) The receiver operating characteristic (ROC) curve of the performance of the ensemble deep learning (DL) model on testing frames. The 95% confidence intervals (CIs) of the ROC curve are indicated by the orange shaded area. The orange solid star represents the performance of the ensemble DL model with the default probability threshold of 0.5 to binarize its output and the dark blue solid circle represents the performance of the average human (AH), which is an aggregate result of six human volunteers’ prediction. We applied majority voting to the six predictions on each testing frame to obtain the average human performance. If each prediction received three votes, we randomly assigned a prediction of before or after onset. (b) Confusion matrix of image classification on testing frames by the ensemble DL model with the binarization threshold of 0.5 and the average human. c Testing accuracy bar chart of the ensemble DL model and the average human compared with no skill (random predictions), where the error bars represent the 95% CI. The ensemble DL model significantly outperforms the average human, and the no skill predictions. *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, NS, not significant, two-sided z-test. All the 95% CIs are estimated by bootstrapping the testing dataset with 1000 replicates.

Understanding image features of interest to the model

We interrogated our model for embryo regions that most strongly affected the model’s predictions, using class activation maps (CAM)30. CAM relies on a heat map representation to highlight pixels that trigger a model to associate an image with a particular class (before or after onset). In Fig. 4, we have overlaid the CAM heat map with the input testing AIF DIC image. In each heat map, red pixels indicate regions of the embryo containing features that correlate positively with the predicted polarization class, while blue pixels indicate regions containing features that correlate negatively (i.e., correlate positively with the opposing class). To understand which regions of an embryo influence our model most, we evaluated each possible prediction outcome: true negative (TN) (Fig. 4a), false positive (FP) (Fig. 4b), false negative (FN) (Fig. 4c), and true positive (TP) (Fig. 4d). When the model classified image frames as after polarization, it appeared to use inter-blastomere angle as a cue (see Discussion). Misclassifications tended to result from mismatched polarity between individual blastomeres and the overall embryo, producing weak prediction probabilities for both classes near 50% while the model was forced to choose one class (Fig. 4c). Predictions in this probability range are more reasonably interpreted as not sure or cannot tell, but these were not options for the model.

Figure 4
figure 4

Visualization of the decision-making by the ensemble deep learning model. Heat maps obtained by the class activation mapping (CAM) technique highlight how the ensemble deep learning model attends the discriminative regions in the testing frame when giving the predicted class label. The red regions indicate positive focus of the model (in alignment with the predicted label) and the blue regions negative focus (in opposition to the predicted label). (ad) correspond to four cases in confusion matrix, true negatives (TN), false positives (FP), false negative (FN), and true positives (TP), respectively. In each subfigure, from left to right are the testing DIC image, its overlay with the focus heat map, and its corresponding fluorescence channel image. On top of the test DIC image is the predicted label of the ensemble DL model with its confidence (from 0 to 100%). On top of the fluorescence image is the annotated label by the expert. All the heat maps show that our DL model either attends to the individual blastomeres or the inter-blastomere angles. For example, TP heat map d focuses on the truly polarized blastomeres. At a certain time-point, some blastomeres have started polarization but the others have not, as shown in the FN case (c). This issue resulted in the DL model making a Type II error with low confidence in the case given. Scale bar = 20 μm.

Model outperforms compaction alone for discrimination

The use of inter-blastomere angle as a cue by our model to determine embryo polarization (Fig. 4a) was not surprising. Inter-blastomere angle is an indicator of embryo compaction31,32, a morphological change during development that typically precedes polarization (Fig. 5a). To assess the extent to which our deep learning model uses just compaction for its polarization prediction, we annotated each embryo’s AIF DIC frame sequence with the time point of compaction. We defined the time of compaction as the first frame at which the smallest inter-blastomere angle of the embryo is over 120 degrees, in agreement with previous research (Supplementary Fig. 4)32. To find the model’s predicted time point of polarization, we re-aligned embryo frames in their original time sequence and applied temporal smoothing on the predicted label sequence for each testing embryo based on majority voting to output a single time point for polarization (Fig. 2c).

Figure 5
figure 5

Comparative analysis of the ensemble deep learning model prediction and the compaction-based prediction for polarization. (a) Chronological order of compaction and polarization events during the 8-cell stage for a normal mouse embryo. (b) Correlation analysis between time points of DL model polarity prediction and compaction. The x and y coordinate are the predicted polarity onset time index of testing embryos (marked in blue solid balls) by the ensemble DL model and the annotated compaction time index, respectively. Their pairwise relationship shows a Pearson correlation coefficient (ρ) of 0.75. (c) Violin plot to visualize the time discrepancy between the annotated and the predicted polarity onset time index on 19 testing embryos by ensemble DL model and compaction proxy, overlaid with a slopegraph showing each testing embryo prediction time discrepancy in pair. From the kernel density estimate (blue shade) of violin plot and the connection line trends of slopegraph, we can tell that the prediction time discrepancy of DL model is significantly lower than the one of compaction proxy. The p-value is specified in the figure for *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, NS not significant, two-sided Wilcoxon matched-pairs signed-rank test.

The Pearson correlation coefficient between compaction time point and the model’s predicted time point of polarization onset was 0.75 across the 19 embryos used for testing (Fig. 5b), suggesting that whilst compaction is indeed a utilized cue, it is not the only factor used by the model. We evaluated whether our model was superior to using compaction alone as a proxy for polarization, by calculating the time discrepancies between annotated polarization time indexes (ground truth) and predicted time indexes by either our model or the compaction proxy. The model had significantly smaller time point prediction errors compared to the latter (two-tailed Wilcoxon matched-pairs signed-rank test, p < 0.05, Fig. 5c). That is, the model was superior to the use of compaction alone for predicting polarization and has likely managed to learn additional cues we do not yet understand.

Distinguishing exact polarization onset time

We wished to further extend our deep learning model to identify the exact point at which polarization occurs in time-sequence videos. To this end, we evaluated polarization onset time point predictions from the classification results of both the model and human volunteers, using a temporal smoothing method (Fig. 2c). Timestamp errors between predicted and annotated time points were calculated as was done previously for compaction time point analysis. Our model had significantly smaller timestamp prediction errors than the average human volunteer by pairwise comparison (two-tailed Wilcoxon matched-pairs signed-rank test, p < 0.01, Fig. 6; Supplementary Fig. 5a).

Figure 6
figure 6

taken from the six volunteers. AH with (w/) time information indicates that six human volunteers were given the chronologically ordered frames for each testing embryo. They directly estimated the polarity onset time point from these time sequences. Statistical analysis uses the ensemble DL model result as the reference to test their difference significance and the p-values are specified in the figure for *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001, NS not significant, two-sided Wilcoxon matched-pairs signed-rank test.

Comparative analysis of the polarity onset time point prediction by the ensemble deep learning model, the average human and the compaction proxy. Violin plot of time discrepancy between the annotated and the predicted polarity onset time index of 19 testing embryos by ensemble DL model, average human (AH) without/with time information and compaction proxy. AH without (w/o) time information (info) means that six human volunteers were given the randomized testing frames without any time information. Their predicted labels were then chronologically ordered for each testing embryo and temporally smoothened in the same manner as the ensemble DL model predictions. The mean discrepancy was

We next wished to investigate whether smoothened results from our ensemble classification model could outperform even human volunteers who are given access to temporal information during testing that the model does not use. To this end, we provided each volunteer with the complete AIF DIC videos in frame-by-frame time sequence for each embryo and asked for their estimate of the polarization onset time point. Compared with the smoothened model classification results performed on individual unordered images, the average human timestamp discrepancy was significantly larger than that of our model (two-tailed Wilcoxon signed-rank test, p < 0.05, Fig. 6; Supplementary Fig. 5b). The model identified exact polarization time points more precisely than the human volunteers, even when the volunteers utilized temporally ordered full video frames that the model did not have access to during training.

Discussion

In this study, we show that an ensemble deep learning model can identify polarization in unstained embryo images from the DIC microscope with an accuracy surpassing that of humans by a wide margin. When classifying 583 test DIC 8-cell stage frames, our model yielded an accuracy of 85% [95% confidence interval (CI): 82.2%–88.2%] compared to corresponding average human accuracy of 61% [95% CI: 57.1%–65.0%].

It is important to note the difficulty of the polarization detection task using unstained embryo images, since to the naked human eye, unstained images do not have any clear features which allow identification of the cellular apical domain. This is reflected in our observed human accuracy of 61%, which represents a performance level barely higher than random chance. Expressed as odds, the odds of a human volunteer correctly differentiating polarization were 1.5—that is, humans were right 1.5 times for each time they were wrong. In contrast, our deep learning model was right 5.7 times for each time it was wrong.

Current embryo selection in IVF clinics relies on crude and qualitative expert inspection of live embryos under plain microscopy that equates to an educated guess. Deep learning is an unusually well-suited solution to providing a more accurate assessment of embryo health for IVF, since deep neural networks recognize subtle features that are difficult for humans to identify33,34,35. Prior research in this field16,17,20 limited itself only to features that are obvious on bright field or DIC imaging such as cell count and size, or to directly predict implantation potential without investigating underlying biological processes19. Our model can potentially enable embryo quality assessment using an important developmental milestone and thereby overcome some limitations of these prior deep learning studies. To our knowledge, there is currently no other known way to adequately evaluate the developmentally critical polarization milestone for embryo health screening prior to selection for implantation. By detecting an underlying developmental feature of the embryo using unstained embryo images, our study provides a platform for a potential future solution to improve IVF technology.

We investigated possible reasons for the successes and failures of our model using the CAM technique and concluded that inter-blastomere angle, an indicator of compaction, was one of the model’s cues for prediction. However, compaction alone was an inferior predictor of polarization compared to the model, suggesting that our model learned additional features informative of polarization that we currently do not understand. The intriguing implication is that more discriminative biology is apparent in simple unstained embryo images than we currently realize. Moreover, our deep learning model was able to identify the exact time point of polarization onset amongst temporally sequenced video frames better than all human volunteers, even with a severe disadvantage in data.

We were able to circumvent 3D image stack analysis through the use of a state-of-the-art all-in-focus algorithm28, which allowed for the efficient collapse of 3D optical data to 2D. Prior studies that apply deep learning to embryo development have used single z slice DIC images as input, which obscures important 3D blastomere features. Our DTCWT-based method of compressing multiple z slices into a single maximally informative 2D representation reduces data size, allowing a full range of published 2D neural network models to become accessible for analysis.

We considered several possibilities during the design of our model architecture. Many DCNN architectures have been published for image classification, including dense convolutional network (DenseNet)36, squeeze-and-excitation network (SENet)37 and residual neural network (ResNet)38. We used cross validation to select a final candidate architecture among them these networks and found that ResNet has the highest average validation accuracy and a low variance (Supplementary Fig. 6). This model became the basis for our individual DCNN modules, of which we combined six to form the final ensemble polarization classifier. Here the number six is to match the number of recruited human volunteers for fair comparison.

Our deep learning-based approach presents a significantly more accurate and less laborious way to evaluate mammalian embryo polarization compared to manual analysis. In future studies, this approach can be used to predict other features that indicate the quality of developing embryos, which can be measured by blastocyst implantation success rate. Furthermore, it could be useful for stainless tracking of polarization in live human embryos, allowing IVF clinics to assess embryo polarity status and its precise timing using non-invasive methods, and move away from empirical embryo grading to a system grounded in established developmental milestones of embryogenesis. In the future, in order to further enhance the generalization ability of our trained model, we can utilize more diverse data e.g., from different institutes and clinics.

In conclusion, we have developed a powerful non-invasive deep learning method to detect embryo polarization from images without the use of fluorescence, while surpassing human performance. This method has a great potential to provide the first example of detecting an underlying developmental feature of the mammalian embryo from unstained images, which could be an important step towards improving IVF technology from the rate of increase in cell number or assessment of morphological features independently of developmental events.

Methods

Assembling the embryo dataset

All mouse experimental data was obtained in accordance with the Animals (Scientific Procedures) Act 1986 Amendment Regulations 2012, under project license by the University of Cambridge Animal Welfare and Ethical Review Body (AWERB). Reporting of animal experiments follows ARRIVE guidelines. Embryos were collected at the 2-cell stage from F1 females (C57BI6xCBA) mated with F1 studs, following super ovulation of the female: injection of 7.5 IU of pregnant mares’ serum gonadotropin (PMSG; Intervet), followed by injection of 7.5 IU of human chorionic gonadotropin (HCG; Intervet) after 48 h and immediately prior to mating.

Embryos were microinjected with Ezrin–red fluorescent protein (RFP) mRNA as a polarity marker before imaging, in each blastomere at the 2-cell stage, as described previously23. Images were collected on confocal Leica SP5 or SP8 microscopes. The interval between each frame on the time (t) axis was 1200 s–2400 s for each embryo, and z frames were taken at 4 μm intervals on the z axis. Time-lapse recordings were converted into TIFF files for analysis and processed on Fiji software. Recordings that were incorrectly formatted, visually unclear, or which showed grossly defective embryos were excluded. From an initial 174 embryo recordings, 89 were used for deep learning and human testing (Supplementary Fig. 1). Only 8-cell stage frames were included in deep learning and analysis (defined as frames from the first frame where 8 distinct blastomeres are visible, to the frame immediately prior to the moment at which the final blastomere starts dividing). The DIC channel images were converted into an AIF DIC frame for each time point as described in Results, and the Ezrin-RFP channel images were converted into maximum intensity z projection frames, prior to annotation.

Embryo annotation (polarization and compaction)

Each embryo time-lapse recording was marked with a polarization onset time by a human expert annotator, corresponding to the first frame in which a polarized blastomere is clearly visible. This was achieved using the maximum intensity z projection Ezrin-RFP frame: the polarization onset frame is one in which the first apical ring or cap is completely and clearly formed (closed) on any blastomere, and which takes up greater than or equal to 1/3 of the surface of the cell as visible in the recording. All frames after and including this polarization onset point were classified as after-onset. All frames prior to this point were classified as before-onset. Compaction time was indicated when smallest inter-blastomere angle was greater than 120 degrees, as previously32. All frames after and including this point were considered compacted, and all frames prior to this point were considered uncompacted.

Ensemble deep learning framework

In this study, two types of effective machine learning techniques, DCNN and ensemble learning, were adopted and combined together for prediction of polarity onset. Multiple (6 here to match the number of human volunteers) DCNNs learnt on the training cohort and then their output predictions were averaged to predict the class label of each testing image. Specifically, the ResNet backbone was chosen as the main part of each DCNN model. A dense layer with two output nodes is added on top of the ResNet backbone. We used the pre-trained weights on ImageNet database as the initialization for each DCNN model. Three of them were trained with SGD optimizer and the other three were trained with Adam optimizer. All of them were trained for 40 epochs. At the end of 40 epochs, all the models converge to nearly 100% in terms of the training accuracy. Different training settings made the six trained CNNs a bit more diverse from each other, where the diversity among CNNs would improve the generalization ability of the ensemble model. To investigate ensemble learning, we first adopted the cross-validation (CV) technique to compare different CNN backbones, including DenseNet36, SENet37 and ResNet38. Based on the results of fivefold CV experiments (Supplementary Fig. 6), we found ResNet is the optimal choice in both prediction performance and computational load. Then, we tried different ensemble techniques and their results are summarized in Supplementary Table 2. From this table, we can see that ensemble on six ResNet models using all the training data by varying the optimization initials and then applying majority voting on the output labels achieved the best testing accuracy. However, its advantage over averaging the output probability is only 0.4%. It is quite marginal, and the latter is more intuitive and common considering the number of ensemble members is even. We further note that there is a limit on ensemble learning improvement over a single classifier, which is mainly bottlenecked by the classifier architecture rather than ensemble techniques.

Human trial

In order to evaluate the performance of our DL model, comparative trials on human volunteers to identify polarity onset were conducted as well. Six human volunteers (3 males, 3 females for gender equality) with a bachelor’s degree in a STEM subject but without prior experience of mouse embryo development studies were recruited from Caltech community, as representatives for competent STEM-trained but inexperienced volunteers who would benefit from the technology in a clinical setting. Volunteers were sent an email with clear instructions and a link to the training and testing data. Each was asked to learn on the training dataset first and then apply their learnt patterns to the testing images, to predict their polarity onset status by filling in an Excel table with predicted labels. After the test, they each returned their Excel file for evaluation.

All participants provided informed consent before taking part in our study. They consented to allow their data to be used in the final analysis and all individuals received reward for participation. The study was approved by Caltech Institutional Review Board.

Evaluation of model and human performance

Results from the testing data—for each of the model and human predictions—were processed as follows: In classification analysis, classified frames from the model/prediction were automatically sorted into one of four categories visible in the confusion matrix (polarized or non-polarized annotated true class, versus polarized or non-polarized predicted class). Cases in which the true class matched the predicted class were scored as an accurate prediction, and cases where the two classes did not match were scored as an inaccurate prediction. Population proportions of accurate results represent the proportion of accurate frames in the total population of frames. For time-smoothened data, the frames were first returned to time-order, after which the polarity onset point was determined by finding the point at which the prediction switched from an unpolarized majority to a polarized majority (as seen in Results). All frames after this polarity onset point were then classified as polarized, and all frames before this point were classified as unpolarized, therefore ‘smoothening’ out any anomalous predictions using time point information. For time point analysis, the polarity onset point (as determined from the smoothening process) was used. For each testing embryo time-lapse recording, the time discrepancy for the model/volunteer was calculated as the actual time difference (to the nearest second) between the predicted polarity onset frame and the annotated polarity onset frame, using the knowledge of the frame-to-frame time difference for each recording. Where no predicted onset frame was given within the allocated recording, for this analysis the frame immediately after the final frame of the time-lapse recording was used as the predicted onset of polarization. These time discrepancies for each embryo were used in pairwise comparisons.

CAM attention map generation

To identify focus areas of our ensemble model, we generated attention heat maps using the class activation mapping technique. To be specific, we multiplied each feature map passing through the global average pooling (GAP) layer of ResNet backbone with their corresponding weight connecting the GAP layer and the fully-connected layer. Then we added the weighted feature maps in an element-wise manner. Each weight tells us how much importance needs to be given to individual feature maps. The final weighted sum gives us a heat map of a particular class (in our case, the before/after polarity onset class), which indicates what pixels our model favors or dislikes to make the final prediction. The heat map size is the same as the one of feature maps. Therefore, to impose it on the input AIF DIC image, we scaled it to the size of the input image and finally got results shown in Fig. 3.

Statistical analysis

Image classification results were compared using a two-tailed z-test of two population proportions with significance classified for p-values as: *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001 and not significant (NS). Time prediction discrepancies were compared using two-sided Wilcoxon matched-pairs signed-rank test since our testing data size is small and not guaranteed as normal. Significance was given for p-values as the same with the above. Further details are given with each result. Statistical analyses were performed using the statistics module in SciPy package with Python (https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html). All the 95% confidence intervals were estimated by bootstrapping the testing dataset with 1000 replicates.

Institutional review board statement

All mouse experimental data was obtained in accordance with the Animals (Scientific Procedures) Act 1986 Amendment Regulations 2012, under project license by the University of Cambridge Animal Welfare and Ethical Review Body (AWERB). Reporting of animal experiments follows ARRIVE guidelines. Embryos were collected at the 2-cell stage from F1 females (C57BI6xCBA) mated with F1 studs, following super ovulation of the female: injection of 7.5 IU of pregnant mares’ serum gonadotropin (PMSG; Intervet), followed by injection of 7.5 IU of human chorionic gonadotropin (HCG; Intervet) after 48 h and immediately prior to mating. All participants in the human trial provided informed consent before taking part in our study. They consented to allow their data to be used in the final analysis and all individuals received reward for participation. The human trial was approved by Caltech Institutional Review Board.