Development and validation of deep learning based embryo selection across multiple days of transfer

This work describes the development and validation of a fully automated deep learning model, iDAScore v2.0, for the evaluation of human embryos incubated for 2, 3, and 5 or more days. We trained and evaluated the model on an extensive and diverse dataset including 181,428 embryos from 22 IVF clinics across the world. To discriminate the transferred embryos with known outcome, we show areas under the receiver operating curve ranging from 0.621 to 0.707 depending on the day of transfer. Predictive performance increased over time and showed a strong correlation with morphokinetic parameters. The model’s performance is equivalent to the KIDScore D3 model on day 3 embryos while it significantly surpasses the performance of KIDScore D5 v3 on day 5+ embryos. This model provides an analysis of time-lapse sequences without the need for user input, and provides a reliable method for ranking embryos for their likelihood of implantation, at both cleavage and blastocyst stages. This greatly improves embryo grading consistency and saves time compared to traditional embryo evaluation methods.


Introduction
Prioritizing embryos for transfer and cryopreservation is a long-standing challenge in the field of in vitro fertilization (IVF) with both academic and commercial research dedicated to its resolution.When multiple good quality embryos are available, selection of the embryo with the highest likelihood of implantation will shorten time to pregnancy and ultimately live birth.Traditionally, embryo evaluation has been carried out by manual inspection of either static microscope images or time-lapse videos of developing embryos.Scoring systems based on morphological and morphokinetic annotations have been used to rank embryos within patient cohorts as well as decide, which embryos to discard and which to transfer and/or cryopreserve.In recent years, however, the use of artificial intelligence (AI) to evaluate embryos has shown promise in both automating the assessment and potentially surpassing the ranking performance of manual inspection [1].
Increasingly, blastocyst transfers has become the preferred development stage for transfer [2], and most AI models for embryo evaluation specifically address embryos cultured to day 5 or later [3,4,5,6].However, blastocyst culture generally results in a lower number of embryos to choose from, and for patients with poor embryo development, cleavage-stage transfers may be preferred if there is a risk of a cancelled cycle [2].Few studies exist that focus on both cleavage-stage and blastocyst transfers [7,8].Erlich et al. [7] propose a combined model for handling day 3 and day 5 transfers, by predicting a score for each image in a time-lapse sequence.Scores from previous images in the sequence are then aggregated temporally.The authors claim that the method provides continuous scoring regardless of development stage and time, and that it outperforms the manual morphokinetic model, KIDScore D3 [9].However, as they only evaluate on day 5 transfers, they ignore the possibility that different embryo characteristics may not be equally important for day 3 and day 5 transfers.Kan-Tor et al. [8] also propose a combined model for handling day 3 and day 5 transfers, by first predicting scores for non-overlapping temporal windows, and then aggregating scores from previous windows in the sequence using logistic regression.The authors show both discrimination and calibration results together with subgroup analyses on patient age and clinics for day 5 transfers.However, for day 3 transfers, only the overall discrimination performance is presented.Therefore, the calibration and generalization performance on day 3 embryos across subgroups such as patient age and clinics remains to be seen.
In addition to day of transfer, current AI models often deviate in how they approach automation.Some methods assume manual preselection by embryologists and can thus be categorized as semi-automated.These are methods that have only been trained on transferred embryos and therefore generally have not seen embryos of poor quality [4,5,6].Other methods approach full automation by training on all embryos, regardless of whether they were transferred or not.These methods rely on other labels than pregnancy for the non-transferred embryos such as manual deselection by embryologists (discards), results of preimplantation genetic testing for aneuploidy (PGT-A), or morphokinetic and/or morphological annotations [7,10,3].For an AI model to be both fully automated and superior in ranking performance on previously transferred embryos, both aspects need Figure 1.iDAScore v2.0.Two separate tracks handle day 2/3 and day 5+ embryos.The first track consists of two 3D convolutional neural networks (CNN) that predict implantation potential and direct cleavages, followed by separate calibration models for day 2 and 3.The second track consists of a 3D CNN that predicts implantation potential followed by a day 5+ calibration model.Finally, scores from both tracks are scaled linearly to the range 1.0-9.9.
to be evaluated [1,11].The performance of both transferred embryos with known implantation data (KID) and non-transferred embryos of different qualities and development stages needs to be evaluated in order to ensure general prospective use.
In this study, we describe the development and validation of a fully automated AI model, iDAScore v2.0, for embryo evaluation on day 2, day 3 and day 5+ embryos.As in our previous work [3], the model is based on 3D convolutions that simultaneously identify both spatial (morphological) and temporal (morphokinetic) patterns in time-lapse image sequences.However, whereas our previous work only dealt with ranking performance, in this study, we also calibrate the model to obtain a linear relationship between model predictions and implantation rates.We train and evaluate our model on an extensive and diverse dataset including 181,428 embryos from 22 IVF clinics across the world.On independent test data, we present both discrimination and calibration performance for embryos transferred after 2, 3 and more than 5 days of incubation, individually, and compare with iDAScore v1 [3,12,13] and the manual morphokinetic models, KIDScore D3 [9] and KIDScore D5 v3 [14].We also present discrimination performance for a range of subgroups including patient age, insemination method, transfer protocol, year of treatment, and fertility clinic.Finally, we perform temporal analyses on score developments from day 2 to 5 to illustrate improvements over time in discrimination performance, temporal changes in ranking, and relation to common morphokinetic parameters used for traditional embryo selection.To the best of our knowledge, our work presents the first AI-based model for ranking embryos from day 2 to day 5+, and is the first study to present calibration curves and subgroup analyses on transferred cleavage-stage embryos.

Study design
The study was a multi-center retrospective cohort study consisting of 249,635 embryos from 34,620 IVF treatments carried out across 22 clinics from 2011 to 2020.As the study focused on day 2, day 3 and day 5+ transfers, day 1 (n=1,243) and day 4 (n=182) embryos were excluded, corresponding to embryos incubated less than 36 hours post insemination (hpi) and embryos incubated between 84 hpi and 108 hpi.Furthermore, embryos without known clinical fate were excluded, as their clinical outcomes were unknown due to follow-up loss (n=3,192) or because they were still cryopreserved at the time of data collection and thus had pending outcomes (n=50,392).After data exclusion, 181,428 embryos remained, of which 33,687 were transferred embryos with known implantation data (KID) measured by the presence of a fetal heartbeat, and 147,741 were discarded by embryologists either due to arrested development, failed fertilization, aneuploidy, or other clinical deselection criteria.Finally, the dataset was split into training (85%) and testing (15%) on treatment level, ensuring that all embryos within a given treatment were either allocated to training or testing.While this split-strategy allows cohort-analyses on the test set, it also mitigates certain types of biases, as the AI model cannot benefit from overfitting to individual patients in the training set.A flow diagram illustrating patients, exclusion of data points (embryos), and division into training and test subsets is shown in Figure 2. Table 1 shows the specific number of discarded embryos and KID embryos with positive (KID+) and negative (KID-) outcomes for each day in the training and test sets.Table 6, Table 7 and Table 8 in the appendix contain further details on patients age, clinical procedures, and embryos for each clinic in the data subsets of day 2, day 3, and day 5+ embryos, respectively.

Image data
All embryos were cultured in EmbryoScope™ , EmbryoScope™ +, or EmbryoScope™ Flex incubators (Vitrolife A/S, Aarhus, Denmark).The incubators acquired time-lapse images during embryo development according to specific settings in each clinic.For EmbryoScope™ incubators, microscope images of 3-9 focal planes of size 500×500 pixels were acquired every 10-30 minutes.For EmbryoScope™ + or EmbryoScope™ Flex incubators, microscope images of 11 focal planes of size 800×800 pixels were acquired every 10 minutes.

Model development
To predict embryo implantation on day 2, 3 and 5+, a combined AI model consisting of several components was developed.Figure 1 shows a flowchart of the model.If an embryo is incubated more than 84 hpi, raw time-lapse images from 20-148 hpi are fed to a 3D convolutional neural network (CNN) that outputs a scalar between 0-1 (Day 5+ model).If, however, the embryo is incubated less than 84 hpi, images from 20-84 hpi are fed to two separate CNN models that evaluate overall implantation potential (Day 2/3 model) and presence of direct cleavages from one to three cells and from two to five cells (Direct cleavage model).The day 2/3 model outputs a scalar between 0-1, and the direct cleavage model outputs two scalars (one for each type of direct cleavage) between 0-1.A logistic regression model then combines the three outputs into a single scalar.Finally, outputs from either day 2/3 or day 5+ are calibrated individually for each day to obtain a linear relationship between scores and implantation rates.At this point, the scores are estimates of pregnancy probabilities representative of the average patient population (including various diagnostic profiles), as opposed to individualized probabilities for each patient.Therefore, to avoid confusing probabilities as being individualized, the calibrated scores are ultimately rescaled to the range 1.0-9.9,similar to the range used in our previous work [3] and by the manual morphokinetic model, KIDScore D5v3 [14].
For more details on model architectures, training methodology including data sampling, preprocessing and augmentation strategies as well as individual results for the components, see Appendix A.

Model validation
Internal validation was used to evaluate the predictive performance of the model on test data in terms of discrimination and calibration [15,1].The area under the receiver operating characteristic curve (AUC) was used to quantify discrimination and reported with 95% confidence intervals using DeLong's algorithm [16].Tests for significant differences in AUC were performed using either paired or unpaired two-tailed DeLong's test [16].Bonferroni-adjusted p-values were used for reporting significant differences between subgroups.Calibration was assessed graphically using observed implantation rates in grouped observations of similar predictions (quantiles) and Loess smoothing [17].

Results
The combined discriminatory performance in terms of AUC for iDAScore v2.0 is presented in Table 2a along with intermediate results by each component from Figure 1.For each day (2, 3 and 5+), the table lists an AUC on all embryos (KID+ vs. KIDand discarded) and on KID embryos (KID+ vs. KID-).The AUCs on day 2, 3 and 5+ were 0.862, 0.873 and 0.954 for all embryos and 0.669, 0.621 and 0.708 for KID embryos.Table 2b provides a comparison with two manual scoring systems, KIDScore D3 [9] and KIDScore D5 v3 [14], as well as our previous work, iDAScore v1 [3], on embryos in the test set that had manual morphological and morphokinetic annotations required by KIDScore.iDAScore v1 was evaluated on iDAScore v2.0's test set.As this includes training samples from v1, the iDAScore v1 performance may be overestimated.On day 3, the AUCs of iDAScore v2.0 and KIDScore D3 on KID embryos were 0.608 and 0.610, with no significant differences according to a paired DeLong's test (p = 0.92).As such, iDAScore v2.0 seems to perform as well as KIDScore D3 on selecting embryos for transfer on day 3, however, without requiring any manual annotations.On day 5+, however, the AUCs of iDAScore v2.0 and KIDScore D5 v3 on KID embryos were 0.694 and 0.644 and significantly different (p < 0.001).When comparing iDAScore v2.0 against the previous version, iDAScore v1, on KID embryos, AUCs were 0.694 and 0.672 and also significantly different (p = 0.047).This suggests that the increased amount of training data and slightly modified training strategies from v1 to v2.0 have improved the model performance significantly.
The calibration performance of iDAScore v2.0 is shown in Figure 3 for day 2, 3 and 5+, individually.In general, there is a good agreement between predicted probabilities and observed implantation rates.Comparing the three curves, we see that both the ranges of predictions and success rates increased from day 2 to 3 and from day 3 to 5+.That is, the best day 3 embryos had higher scores and a higher implantation rate than the best day 2 embryos.And on day 5+, we observe both the highest and lowest scores as well as the highest and lowest implantation rates.This suggests that with more information available on the blastocyst stage, the model can more confidently assign a probability of implantation, ranging from around 6% for the lowest scores up to 65% for the highest scores on day 5+.As these predictions were made based on time-lapse images alone, however, they represent average patient probabilities and not individualized patient probabilities.To predict probabilities on patient-level, additional characteristics such as patient demographics and clinical practice should be included in the calibration procedure and analysis [1,18].However, these aspects are outside the scope of this work.1b, whereas (b) compares performance with KIDScore D3, KIDScore D5 and iDAScore v1 models on embryos that have manual annotations available as required by KIDScore.

Subgroup analysis
To investigate the generalization performance of iDAScore v2.0 across different patient demographics and clinical practices, subgroup analyses was performed on KID embryos for the following parameters: patient age (<30, 30--34, 35-39 and >39 years), insemination method (IVF and ICSI), transfer protocol (fresh and cryopreserved), treatment year (<2015, 2015-2016, 2017-2018, >2018), and clinics .The results are available in Table 9 in the appendix that lists the number of KID embryos and corresponding AUCs for each subgroup.Using unpaired DeLong's test, it was found that on day 5+, AUCs for the age group > 39 were significantly higher than all other age groups (p < 0.03).For transfer protocol, a significant difference was found between AUCs for fresh and cryopreserved transfers on day 5+ (p = 0.03).A significant difference was found between treatment years > 2018 and 2015-2016 on day 2 (p = 0.02).While this difference in theory could indicate temporal biases due to improvements in IVF treatments over time, it may also represent differences between clinics, as not all clinics contributed with data across all years.Differences between individual clinic AUCs were significant in multiple cases.On day 5+, this includes clinic 18 vs 20 (p = 0.02) and clinic 21 vs 1, 5, 10, 11, 16, 20 (p < 0.05).Clinic 21 thus performed significantly different than most other clinics on day 5+ and had the lowest AUC of 0.516, indicating close to random discrimination performance.This may be due to a variety of factors.Most importantly, clinic 21 was the only clinic to perform PGT-A routinely, and thus only transferred euploid embryos.It is expected that this would lower the AUC of any selection algorithm that correlates with euploidy.Wehn evaluating the performance for discriminating between euploid (n = 178) and aneuploid (n = 269) embryos from clinic 21, a considerably higher AUC of 0.68 for iDAScore v2.0 was evident, in line with the expectation.

Predictive performance over time
The predictive performance over time was assessed by evaluating the model at 12 hour intervals from 38-122 hpi.The performance was not assessed between 84-108 hpi as the training data does not include day 4 transfers.The model was evaluated on the day 5+ test set as this is the only set that allows evaluation at all points in time.Two different evaluation methods were used: The AUC for predictions at different times on the day 5+ test set in Figure 4a, and the rate at which the highest scoring embryo in each treatment is KID+ in Figure 4b.
There is a significant improvement in performance when going from predictions on cleavage stage transfers to predictions on blastocyst stage transfers.This is in agreement with previous reports [7,8].There is a small improvement in predictive performance for later predictions on day 2 and day 5, while day 3 appears to have the same performance.

The most recent prediction contains all information
Since iDAScore v2.0 is based on the entire video of embryo development up to a given point in time, we expect the most recent prediction to be the most informative and earlier predictions to provide no additional information.This is however not a guarantee with deep learning and thus requires validation.
Given two predictions at times t a and t b , if the prediction at t a , contains no additional information compared to the prediction at t b , with regards to predicting the implantation likelihood then the prediction at t a and the implantation likelihood are conditionally independent given the prediction at t b .
We tested whether earlier predictions contain any additional information by using a kernel-based conditional independence test [19] on the day 5+ test set of embryos with KID.We use the median incubation times and four hours prior for each day.With p-values below 0.05 we reject the null hypothesis that a prediction at t a and the implantation likelihood are conditionally independent given a prediction at t b .The p-values are adjusted using Bonferroni correction.3. Conditional independence test with the null hypothesis that a prediction at t a and the implantation likelihood are conditionally independent given a prediction at t b .A p-value below 0.05 means that the prediction at t a provides additional information compared to the prediction at t b .The p-values are adjusted using Bonferroni correction.
The p-values of the conditional independence test for predictions are shown in Table 3.The p-values above the diagonal show whether an earlier prediction provides additional information given a later prediction and the p-values below the diagonal show whether a later prediction provides additional information given an earlier prediction.As the p-values above the diagonal are all above 0.05 we cannot reject the null hypothesis suggesting that they do not provide any additional information.Conversely, in nearly all cases additional information was gained by getting a later prediction except for at (68 h, 64 h) and (44 h, 40 h) where the extra four hours did not add significantly new information.

Correlation with morphokinetics
The biological explainability of iDAScore v2.0 was evaluated by estimating the average implantation rate for groups of embryos with similar morphokinetic parameters and comparing them with the predicted implantation rate.The morphokinetic parameters are t PNf , t 2 , t 4 , t 8 , t 3 − t PNf , and t 5 − t 3 .We estimated the average implantation rate by grouping embryos with a morphokinetic parameter in five-or ten-hour intervals.E.g. for estimating the implantation rate of embryos with a t 2 of 30 hpi, we compute the mean implantation for embryos with a t 2 between 27.5 hpi and 32.5 hpi and compute its 95% confidence interval along with the mean prediction after 44, 68, and 116 hpi.Only those embryos with an annotation of each parameter were included in the analysis.Many clinics select embryos for transfer based on morphology and morphokinetics which results in a limited amount of data with known implantation outside the range of normal development.Therefore, we included discarded embryos in this analysis under the assumption that all discarded embryos would not have implanted.
There is a bias in which clinics annotated the above morphokinetic parameters, therefore the model output was re-calibrated using only data with annotations to isolate the response to the morphokinetic parameters.The comparison between the implantation rate for embryos with similar morphokinetic parameters and the prediction is shown in Figure 5. Evaluation of biological explainability of iDAScore v2.0 that shows a comparison between the estimated implantation likelihood for embryos with similar morphokinetic parameters (KID+ rate) and their predicted implantation rate at 116 hpi, 68 hpi, and 44 hpi.The evaluation uses the day 5+ test set with an assumption that deselected embryos have an implantation likelihood of 0%.The bars denote the 95% confidence interval.There is a good concordance between the predicted implantation rate and the actual implantation rate with slight overestimation at the extremes.
The predicted implantation rate is often within the confidence interval of the actual implantation rate for different timings of the morphokinetic events.The 116 hpi predictions is closest to the actual implantation rate compared to the predictions at 68 hpi and 44 hpi.Some predictions at 44 hpi for embryos with a t 5 − t 3 of 20 has a t 5 that is later than 44 hpi and thus the model has no chance of even seeing t 5 .For the other morphokinetic parameters there is a slight overestimation of the implantation rate at the extremes which is more pronounced for predictions at 44 hpi and 68 hpi than at 116 hpi.Overall the changes in the predicted implantation rates follows the trend for the actual implantation rates.This suggests the model has either learned to recognise the morphokinetic parameters or some features that heavily correlate with the morphokinetic parameters.
Embryo selection is the task of prioritizing, among all available embryos from a patient, the order of transfer, and which to cryopreserve or discard.When automating and potentially improving this task using AI, it is important to be aware of potential biases introduced.If training and evaluation are carried out solely on transferred embryos, the dataset will be biased towards good quality embryos, and the model may not generalize to embryos of poor quality.In practice, this means that a manual preselection by embryologists of which embryos are good enough to transfer is implicitly assumed.To avoid this issue, we included discarded embryos in the training set and balanced their contribution by oversampling the transferred embryos, just as in our previous work [3].Similar approaches have been proposed by others by including non-transferred embryos during training using pseudo soft labeling [7], or by adding aneuploid embryos determined with PGT-A testing to the negative class [10].
Another source of bias can occur when assuming the selection criteria are independent of the day of transfer.That is, cleavage-stage embryo characteristics may not have the same importance or interactions in predicting outcomes for day 2 and day 5 transfers.Therefore, training or evaluating cleavage-stage models on outcomes from blastocyst transfers as described by Erlich et al. [7] may bias the results.In our preliminary experiments, we observed that separate AI models for cleavage-stage embryos and blastocysts resulted in higher performance than a single combined model.This suggests that the optimal cleavage-stage characteristics for selecting an embryo to transfer on day 2 or 3 may not actually be the same as for selecting (2-3 days ahead of time) which embryo to transfer on day 5. Therefore, it is essential to evaluate AI model predictions based on actual transfer day as was presented in our validation.As an exception, we presented temporal score analyses on day 5+ embryos, as these were the only embryos that could be evaluated across the entire development period from 20--148 hpi.Here, day 5+ outcomes were assumed also to be representative of cleavage-stage outcomes, which is a limitation of the analyses.
To address generalization performance across various potential confounders, subgroup comparisons of AUCs for the variables age, insemination method, transfer protocol, year of treatment, and IVF clinic were presented.Here, we found significant differences between the age group > 39 and all other age groups.A general trend of higher AUCs was observed with increased age.This observation aligns well with other reports that have shown AUCs to increase with age [6,20,7].This is expected, since age is a strong independent predictor of pregnancy.However, including age as an input to the AI model is advised against, since increases in AUCs do not necessarily reflect improvements in ranking performance within single patient cohorts [21,1].Erlich et al. [7] have speculated that differences in endometrial receptivity with age may influence label noise and thus subgroup AUCs.The trend may also be caused by general differences in embryo qualities between younger and elder women.Younger women typically have transferred embryos of high quality, whereas elder women more often have transfers with poor or medium quality embryos.The wider distribution of embryo qualities for elder women thus results in higher AUCs.To eliminate such biases, we also evaluated embryo ranking for different age groups on treatment-level.For this, we calculated the rate at which the highest scoring embryo in each treatment was a KID+.For the same age groups < 30, 30-34, 35-39 and > 39 years, the rates were 0.861, 0.836, 0.858, and 0.835, showing a slight decrease in performance with age, if anything.This suggests that the higher AUCs with increased age are not caused by a lack of model generalization but possibly by a bias in the distribution of embryo quality for transferred embryos for elder women.
Our subgroup comparisons also revealed significant differences between transfer protocols, certain years of treatment, and certain clinics.While this may indicate generalization weaknesses, it may also originate from other biases in the dataset, such as different age distributions across clinics and years.It may be relevant to adjust for known confounders such as age when comparing other subgroups.Theoretically, this should help isolate variables and provide less biased subgroup performance evaluations.
There was a significant information gain from later culture day predictions with increases in performance on both AUC and the rate of KID+ scoring highest in each treatment.The most significant performance increase comes when going from predictions at the cleavage stage to the blastocyst stage, while the intraday performance is only slightly higher for day 2 and day 5+ and the same for day 3.
To address the biological explainability of the model, we compared the estimated implantation rate and the predicted implantation rate for embryos grouped by similarity for various morphokinetic parameters.Here we assume that all discarded embryos would never implant which is not guaranteed to be correct but since we calibrate using the same assumptions, it is unlikely to bias the comparison.It does however result in implantation rates that are significantly lower than those of transferred embryos, making the actual values uninteresting.In general we see a good concordance between the estimated and the predicted implantation rates except for very late t 2 and t 4 predictions along with predictions after 44 hpi and 68 hpi for embryos with a t 5 − t 3 of 0 hpi and 5 hpi.As this is not an issue for t 3 − t PNf , it is likely a result of the performance difference between predicting direct cleavages from one to three cells and direct cleavages from two to five cells shown in Table 5 in the appendix.For embryos with a late t 2 and t 4 , the implantation rate is overestimated but still gives lower implantation likelihood than for embryos with earlier t 2 and t 4 , which suggests that it does not impact the ability to rank embryos.

8/16
A limitation of the present study is that it used internal validation to evaluate generalization performance both overall and across subgroups.To evaluate actual generalization performance to new clinics that have not taken part in the training process, external validation should be performed.In order to eliminate potential biases caused by retrospective evaluation, a prospective study should be used to reveal the actual performance in a clinical setting.Currently, an ongoing randomized controlled trial (The VISA Study, NCT04969822) is investigating how iDAScore v1 [3] performs compared to manual grading on day 5 embryos.and KID embryos between the day 2/3 model and KIDScore D3.Therefore, a separate model was developed to detect direct cleavages along with a combination model to balance the direct cleavage scores and the original day 2/3 scores.
The direct cleavage model is trained on manually annotated embryos from the training set in Table 1b to predict the presence of direct cleavages.A direct cleavage from 1 to 3 cells (DC13) is defined as t 3 − t PNf < 7.6 hpi, where t 3 and t PNf denote the timing of cell division to 3 cells and pronuclear fading, respectively.A direct cleavage from 2 to 5 cells (DC25) is defined as t 5 − t 3 < 5.0 hpi, where t 5 denotes the timing of cell division to 5 cells.The dataset for direct cleavages is described in Table 4 1a.
The architecture of the direct cleavage model is visualized in Figure 6.It consists of a MobileNetV2 [27] backbone applied to 1 frame/hour from 20-84 hp, with shared weights across all frames.The output for each frame (8 × 8 × 1280) is then spatially averaged (1 × 1 × 1280) and temporally concatenated into a 64 × 1280 feature vector.This feature vector is passed into two fully convolutional networks, one for predicting DC13 and one for predicting DC25.Each network consists of 7 one-dimensional convolutional layers with kernel sizes [1,4,4,4,4,4,1] and output channels [128,128,64,32,32,32,1].The last layer uses a sigmoid activation function, whereas all other layers use the rectified linear unit.Finally, the maximum value along the temporal output vector (49 × 1) represents the prediction score of either DC13 or DC25.The DC model is trained using the Adam optimizer [24] with β 1 and β 2 set to 0.9 and 0.999, respectively, with a one-cycle learning rate schedule [25] and an initial learning rate of 1e-4 and a maximum learning rate of 1e-3.We sample random frame sequences of 16 frames from all videos in the training dataset and label them as DC13 and/or DC25 if the respective direct 12/16 cleavage is fully visible in the frames according to the embryologist annotations.The training data are sampled such that non direct cleavages, DC13 and DC25 represent 50%, 25% and 25% of a batch, respectively.We use a batch size of 24, 15% dropout, and a loss function consisting of the sum of two binary cross-entropy losses, one for each output.The model is trained with 52,200 batches.
Table 1b describes the test set used to evaluate the direct cleavage model as well as two confusion matrices, showing classification performance for DC13 and DC25, individually.The model achieves an accuracy of 92% and a AUC of 0.95 for DC13, and an accuracy of 90% and a AUC of 0.88 for DC25.For estimation of the logistic parameters, the KID outcome is the independent variable, whereas model predictions are predictors that are first extracted for both the day 2/3 model (y Day 2/3 ) and the direct cleavage model (y DC13 and y DC25 ).For calculating day 2/3 model scores, day 5+ image sequences are truncated by multiples of 24 hours to resemble day 2 or day 3 image sequences.multiples of 24 hours are subtracted from the day 5+ image sequences to resemble day 2 and 3 sequences.The resulting model is given by: The results of the combination model are available in Table 2a and Table 2b.

A.3 Calibration
In order to facilitate the use of iDAScore v2.0 for both ranking embryos within a cohort and predicting chances of pregnancy, we calibrate the score output to better match implantation rates [1].We calibrate the model separately for each day (2, 3, and 5+) on single embryo transfers on the training set in Table 1a.The models are calibrated using Platt scaling [28] which is based on a logistic regression model.The calibration curves are shown in Figure 3. Since implantation probabilities depend not only on embryo characteristics but also patient demographics and clinical practice, calibration performance may not be generalizable across different subgroups.Therefore, to avoid confusion between patient-wide and patient-specific probabilities, the calibrated scores are ultimately linearly scaled to the range 1.0-9.9,which is the final output of iDAScore v2.0.

Figure 3 .
Figure 3. Calibration curves linking predicted probabilities to actual success rates for day 2, 3 and 5+ single embryo transfers, respectively.The dotted line represents perfect calibration.Grouped observations (triangles) represent success rates for embryos grouped by similar predictions.Loess calibration (solid line) represents a smoothed estimate of observed success rates in relation to model predictions.The shaded area is the 95% confidence interval.The relative distributions of scores for positive and negative pregnancy outcomes are shown at the bottom of the graph.

Figure 4 .
Figure 4. Evaluation of predictive performance over time.

Figure 5 .
Figure5.Evaluation of biological explainability of iDAScore v2.0 that shows a comparison between the estimated implantation likelihood for embryos with similar morphokinetic parameters (KID+ rate) and their predicted implantation rate at 116 hpi, 68 hpi, and 44 hpi.The evaluation uses the day 5+ test set with an assumption that deselected embryos have an implantation likelihood of 0%.The bars denote the 95% confidence interval.There is a good concordance between the predicted implantation rate and the actual implantation rate with slight overestimation at the extremes.

Figure 6 .
Figure 6.Architecture overview of model that predicts whether there is a direct cleavage from one to three cells or from two to five cells.

Table 1 .
Datasets for training and testing the model.
Comparisons on subset of test set with annotations required by KIDScore D3 and D5.

Table 2 .
AUCs on the test set for the different model components across days of incubation.All denotes KID+ vs. KID-and discarded embryos, whereas KID denotes KID+ vs. KID-embryos.All AUCs are reported with 95% confidence intervals in brackets.(a) lists results on the full test set from Table Rate at which top scoring embryo is KID+.

Table 4 .
Datasets used for developing the direct cleavage model and combination model.Both the (a) training data (80%) and (b) validation data (20%) are subsets of the original training dataset in Table

Table 5 .
Test data and test results for the direct cleavage model.The test data in (a) is a subset of the original test dataset in Table 1b.(b)and(c) show confusion matrices of thresholded model predictions for the DC13 and DC25 outputs, individually.Thresholds are chosen to maximize accuracy on the validation data in Table 4b.To combine the scores of the day 2/3 model and the direct cleavage model, a multivariate logistic regression model is developed.Due to the rare presence of direct cleavages in the training set of day 2 and 3 KID embryos, we use day 5+ KID embryos from the training set in Table1ato estimate the parameters of the logistic model.This is because transferred embryos on day 5+ in general include more direct cleavages than on day 2 or 3, since blastocyst presence outweighs cleavage stage morphokinetic parameters such as DC13 and DC25 in most selection strategies.

Table 6 .
Distribution by clinic of day 2 embryos.

Table 7 .
Distribution by clinic of day 3 embryos.

Table 8 .
Distribution by clinic of day 5+ embryos.