A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment

Background Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference −1.4 ± 4.5 days, 95% CI −1.8, −0.9, n = 406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n = 613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 s after completion of a sweep. Conclusions The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.

Q8: Line 452-453: "No intermediate fetal biometric measurements were required during training or generated during inference." What' s the advantages of this? Fetal biometric measurements can provide additional information for helping the clinical evaluation. Also, during model training, compared to predicting the GA directly, combing the biometric parameters learning through a multitask manner has the potential to optimize the model training.
Q9: Line 467-469: "For each training set case, fetal malpresentation was specified as one of four possible values by a sonographer (cephalic, breech, transverse, oblique), and dichotomized to "cephalic" vs "non-cephalic"." a) Why the numbers of categories for the position of the fetus are different (four and two) in the description? b) If the AI model only requires to focus on binary classification, why "breech, transverse and oblique" annotations are required during labeling? c) Besides, if four classification tags have already been labeled by the sonographer, why not consider a four-type classification model? d) Further, if learning a four-type classifier is more difficult than a binary model, the author can consider combing the two types of the classification models into one model, which is easy to implement and can help improve the model performance while reducing the learning difficulties.
Q10: Line 483-485: "For the GA model, each blind sweep was divided into multiple video sequences. For the fetal malpresentation model, video sequences corresponded to a single complete blind sweep." Why the two models have different input sizes (multiple sub-videos and the whole video)? As described, the feature extractor for both MobileNetV2, so the model inputs can theoretically be the same. Q11: Line 491-492: "The convolutional component uses the MobileNetV2 network as a feature extractor for each video frame." The authors only perform experiments based on MobileNetV2. However, as a study focusing on mobile device models, the author should consider further experiments about different light-weight architectures, including MobileNetV3 and the Shuffle-Net family, etc. The models can be easily found in Pytorch and Torchvision (also in TensorFlow/Keras). Q12: Line 537-539: Why the input resolutions of two models are different? The motivation should be well introduced. Also, in line 541-550, motivation of different setting for two models should be provided.
Q13: Line 553-553: "The gestational age regression model uses the gestational age ground truth associated with the case as the training label for all video clips within the case." An accurate GA value should be calculated and evaluated on the specific plane (i.e., standard plane). However, the authors treat the ground truths of multiple video clips as the same GA. Since not all the planes in the video clips are suitable for GA measurement, this may seriously confuse the model learning.
Q14: Line 562-574: Some unusual parameter settings need to be better explained. For example, "keep probability of 0.863 for the gestational age model", "the gestational age model began with a learning rate of 4.58e-4 and ended with 4.58e-7 after 1 million training steps", etc.
Q15: Line 588-589: How to ensure all the divided clips have the same length to match the input of LSTM? In other words, the uniform division easily leads to inconsistent sub-video length, how to handle this? Q16: Line 602-604: Is fetal presentation model also equip with a LSTM module? If yes, why the whole sweep video can input to the model, without the requirement to strictly match the same length as LSTM like in the GA model? If not, the authors should consider describing the differences between model architectures more clearly.
Q17: Line 618-638: Some vital information is missed on the figures. For example, in part "C", the feedback score of each image should be provided in detail (like putting detailed values on the leftdown corner).
Q18: The authors can consider combing the two models into one. This paper focus on mobile model construction, thus, fewer model parameters make deployment easier. The two models have almost the same model architectures (Extended Data Figure 3), so, there is no technical difficulty in merging them together. If works, both training and evaluation will become more simplified, and both GA and classification results can be shown on phone simultaneously, which is faster and memory-saving.
Q19: Typo： The author should double-check the whole text carefully. For example: Line 97: "an a prior criterion" Line 112: "bind sweep" -> "blind sweep" Reviewer #2 (Remarks to the Author): Part 1 This is a very well written and intriguing manuscript that basically describes the hope for the future. The authors have done an excellent job of both describing the study parameters as well as the results (part 1 ) While this reviewer does have experience in the techniques described including AI in imaging and imaging education in low resource settings the paper nicely provides evidence that the future is closer than many people think. The data nicely shows the differences between experienced and no experienced,high end machines and more basic machines however to the credit of the study these differences are not extreme. Specific comments to the authors 1.The authors could add a number of other low end innovative scanners that are used by patients that has almost immediate feed back that provides a safety net 2.The authors have omitted a number of other references to the 6 step method that has shown value that could be compared to the current study 3.Some commercially available low end systems ($500) that have feedback should also be considered 4.What was the reason for less sweeps in the low end group.? It certainly doesn't extend the study time to any significance 5.Was there any feed back given by the patient?If so were there specific questions? And if so what was revealed? 6.Please define ground truth for the readeers With regards to reviewing the methods section from line 443 I am confused as to where this fits?Is it really part of the paper? a supplement? It reads well and is interesting and appears valauble but seems out of place and doesn't flow well Response to Reviewer #1: Thank you for your detailed and thoughtful review. Please see our responses to your comments and questions: The authors proposed an AI-based system for fetal ultrasound analysis. Specifically, the system can handle automatic GA and fetal malpresentation estimation in videos obtained under the "blind sweep" protocol. Experiments validated on Fetal Age Machine Learning Initiative (FAMLI) and Novice User Study Datasets, the proposed method achieve good GA regression estimation and high AUC-ROC in the binary classification task. Also, the models show robustness across operators (experts and novices) and devices (standard and low-cost). The proposed method seems to have the potential to help the low-to-middle-income countries (LMICs), which lack ultrasound devices for lifesaving treatment. However, I have the following major concerns: • Q1: Title of this paper is not very accurate. It's a very big topic for "AI system for fetal ultrasound", however, this paper only covers two sub-topics: GA and fetal malpresentation estimation. ○ We have revised the title of the manuscript to: "Expanding global access to fetal ultrasound: A mobile-optimized AI system for gestational age and fetal malpresentation assessment" ○ A large number of different types of sweeps were performed during study data collection since it was designed before the AI model results were known. We limited our test set evaluations to 6 and 2 sweeps so that the procedure can be feasibly taught and performed by non-experts in LMIC settings. This is indicated in the "Simplified sweep evaluation" section: either 6 or 2 sweeps are sufficient for maintaining non-inferiority of the sweep procedure relative to the clinical standard, and MAE for both procedures is similar (Table 1). The AI model makes gestational age predictions on fixed length video clips, and these predictions are aggregated together to form the final prediction. The number of sweeps per patient visit changes the number of predictions available for aggregation but the underlying AI model does not change based on the number of sweeps performed. We added clarification to the Mobile Device Inference section, third paragraph.
• Q5: Line 307-318: Caption about "Top left" seems miss. ○ We have added the missing caption in Figure 2 of the revised manuscript.  ○ Fetal biometry does not involve blind sweeps but rather deliberate measurement of expertly-acquired images (HC, AC, FL, etc). It is true that more raw information may be available in the sweeps, but this is a benefit rather than a drawback of using our method. Without AI, there was previously no way to process this information using human interpretation and clinical standard estimates ignore this information. ○ Clinical standard estimates can not use blind sweeps, they require images carefully selected by an expert sonographer and manual measurement of the size of anatomy in the images. We have clarified this is in the Introduction.
○ Standard fetal biometry doesn't use sweep videos, but rather single images. An evaluation of AI on these images is an interesting topic for another paper but is out of scope here. This would require a different AI model that operates on single images rather than video. • d) See "Standard fetal biometry estimates MAE ± sd (days)", why data collected by novices with standard ultrasound device has better performance than that collected by sonographers (4.8 days vs. 5.2 days)? This seems a little unreasonable.
○ Standard fetal biometry estimates are always collected on full-size commercial machines by expert sonographers (the procedure cannot be performed by novices) even in the novice dataset. Only the blind sweeps are collected by novices. We have clarified this in the "Fetal Age Machine Learning Initiative (FAMLI) and Novice User Study Datasets" section, first paragraph, and the caption for Table 1. The difference in estimates the reviewer points out is within the standard deviation of random variability and not due to differences in the data collection procedure. ○ The purpose of our system is to remove the requirement of an expert sonographer from the equation. An expert sonographer is required to acquire fetal biometric measurements. ○ We mentioned this to emphasize the difference between the way our AI+blind sweep model works (end-to-end from pixels to predictions) versus the clinical standard, which produces intermediate measurements. There may be a potential use case for including fetal biometric measurements during training. However, we experimented with this and in our case it didn't help model performance.
• Q9: Line 467-469: "For each training set case, fetal malpresentation was specified as one of four possible values by a sonographer (cephalic, breech, transverse, oblique), and dichotomized to "cephalic" vs "non-cephalic"." • a) Why the numbers of categories for the position of the fetus are different (four and two) in the description?
○ From a clinical perspective, "cephalic" is considered normal while all noncephalic presentations are abnormal and require further attention. We've added a clarification in the Algorithm Development section, third paragraph. We envision our AI model serving as a triage tool for flagging abnormal cases for referral.

• b) If the AI model only requires to focus on binary classification, why "breech, transverse and oblique" annotations are required during labeling?
○ These labels come from standard sonographic exams during the patient visit, and this is the clinical standard assessment. • c) Besides, if four classification tags have already been labeled by the sonographer, why not consider a four-type classification model?
○ Evaluation metrics for 2-class problems are better defined than multi-class metrics and this 2-class dichotomization is more compatible with the clinical use case for triaging abnormal cases. Multi-class analogues for AUC and sensitivity/specificity are less commonly used and more difficult to interpret. • d) Further, if learning a four-type classifier is more difficult than a binary model, the author can consider combining the two types of the classification models into one model, which is easy to implement and can help improve the model performance while reducing the learning difficulties.
○ While this may be a useful suggestion for future work, it's not clear that including the additional label information will improve performance of the model when evaluated on the cephalic versus non-cephalic binary task, which we feel is the fundamental problem from a clinical perspective.
• Q10: Line 483-485: "For the GA model, each blind sweep was divided into multiple video sequences. For the fetal malpresentation model, video sequences corresponded to a single complete blind sweep." Why the two models have different input sizes (multiple sub-videos and the whole video)? As described, the feature extractor for both MobileNetV2, so the model inputs can theoretically be the same.
○ The two problems are very different. Gestational age can reasonably be estimated from short video clips (even a single image) and these estimates can be aggregated together to improve performance. Fetal malpresentation can not be determined by looking at a short sequence or individual images. Instead, judging the orientation of the fetus requires viewing the full video to assess spatial relationships between anatomical regions. We have added clarification in the Data Preprocessing section, fourth paragraph.  Figure 3), so, there is no technical difficulty in merging them together. If works, both training and evaluation will become more simplified, and both GA and classification results can be shown on phone simultaneously, which is faster and memory-saving. ○ This could be considered for future work, but combining the two models is not simple given the different video pre-processing and sequence lengths used for fetal presentation and GA models. See explanation above on differences between the two estimation problems. We demonstrated the feasibility of executing both models simultaneously on the same device without sacrificing real time performance.
• Q19: Typo： The author should double-check the whole text carefully. For example: Line 97: "an a prior criterion". ○ This reads "an a priori criterion for non-inferiority" and is not a typo. "A priori" indicates the criterion for deciding non-inferiority was determined before analysis was performed. • Line 112: "bind sweep" -> "blind sweep" ○ Corrected in the revised manuscript (Mobile-device-optimized AI gestational age and fetal malpresentation estimation section, third paragraph).
Response to Reviewer #2: Thank you for your encouraging comments and thoughtful review. Please see our responses below: