Artificial intelligence assistance for fetal development: evaluation of an automated software for biometry measurements in the mid-trimester

Background This study presents CUPID, an advanced automated measurement software based on Artificial Intelligence (AI), designed to evaluate nine fetal biometric parameters in the mid-trimester. Our primary objective was to assess and compare the CUPID performance of experienced senior and junior radiologists. Materials and methods This prospective cross-sectional study was conducted at Shenzhen University General Hospital between September 2022 and June 2023, and focused on mid-trimester fetuses. All ultrasound images of the six standard planes, that enabled the evaluation of nine biometric measurements, were included to compare the performance of CUPID through subjective and objective assessments. Results There were 642 fetuses with a mean (±SD) age of 22 ± 2.82 weeks at enrollment. In the subjective quality assessment, out of 642 images representing nine biometric measurements, 617-635 images (90.65-96.11%) of CUPID caliper placements were determined to be accurately placed and did not require any adjustments. Whereas, for the junior category, 447-691 images (69.63-92.06%) were determined to be accurately placed and did not require any adjustments. In the objective measurement indicators, across all nine biometric parameters and estimated fetal weight (EFW), the intra-class correlation coefficients (ICC) (0.843-0.990) and Pearson correlation coefficients (PCC) (0.765-0.978) between the senior radiologist and CUPID reflected good reliability compared with the ICC (0.306-0.937) and PCC (0.566-0.947) between the senior and junior radiologists. Additionally, the mean absolute error (MAE), percentage error (PE), and average error in days of gestation were lower between the senior and CUPID compared to the difference between the senior and junior radiologists. The specific differences are as follows: MAE (0.36-2.53 mm, 14.67 g) compared to (0.64- 8.13 mm, 38.05 g), PE (0.94-9.38%) compared to (1.58-16.04%), and average error in days (3.99-7.92 days) compared to (4.35-11.06 days). In the time-consuming task, CUPID only takes 0.05-0.07 s to measure nine biometric parameters, while senior and junior radiologists require 4.79-11.68 s and 4.95-13.44 s, respectively. Conclusions CUPID has proven to be highly accurate and efficient software for automatically measuring fetal biometry, gestational age, and fetal weight, providing a precise and fast tool for assessing fetal growth and development.


Background
Accurate biometric measurements conducted on ultrasound images enable the evaluation of fetal normality, including the estimation of fetal size, gestational age (GA), and estimated fetal weight (EFW) [1][2][3][4].This differentiation is important for distinguishing between fetal size at a given timepoint and fetal growth [5,6].Ultrasound measurements can also facilitate the identification of abnormalities.They can detect developmental abnormalities of individual organs, such as the central nervous system (CNS) [7,8], skeletal and limb systems [9], uneven development, as well as overall developmental abnormalities like fetal growth restriction (FGR), small for gestational age (SGA), and large for gestational age (LGA) [6,10,11].Comprehensive measurements can help in making informed decisions regarding the fetus, including potential interventions, intrauterine therapy, or even the option of pregnancy termination [12,13].Since the accuracy of biometric measurements depends heavily on the operator's expertise [14], it results in poor consistency in biometric measurements, and potential diagnostic errors [15].Moreover, operators can cause repetitive stress injuries through multiple keystrokes and are time-consuming [16], especially during refined mid-trimester measurements to assess fetal growth and development [17][18][19][20].
The main objective of this pilot study was to assess CUPID's performance and efficiency in measuring the fetus's nine biometric parameters compared to two radiologists with different levels of experience.

Data collection
A prospective cross-sectional study was conducted at Shenzhen University General Hospital between September 2022 and June 2023 in which 700 pregnant women in their mid-trimester were enrolled, as shown in Fig. 1.Only women with healthy singleton pregnancies and a certain fetal crown-rump length were included in the study.All examinations were performed by two senior radiologists with over 10 years of experience in obstetrics, using GE Voluson E8/E10 ultrasound machines (GE Healthcare, Zipf, Austria) equipped with C1-6 probes.The collected patient measurement data comprised six standard planes: transcranial, transthalamic, transcerebellar, abdominal circumference, femur, and humerus.We implemented quality control on the collected images, for which two expert radiologists with over 20 years of experience not involved in data collection evaluated the six standard planes following the ISUOG guidelines.Specific evaluation criteria included complete anatomy, appropriate size, and high image quality to ensure optimal imaging plane acquisition [2].The evaluation results included both standard and non-standard planes, and only images simultaneously rated as standard planes by both experts were further included in the study.After quality control process, a total of 642 cases were enrolled.The Research Ethics Committee of Shenzhen University General Hospital approved the study, and informed consent was obtained from all women.

Study design
We designed our study based on a validation dataset obtained after a rigorous review process.This study involved conducting independent and comprehensive data collection within the setting of Shenzhen University General Hospital, without modifying the algorithm.Biometric measurements were taken twice at two-week intervals by both the manual and automatic groups, following the standards defined by ISUOG Practice Guidelines [2,6,21].The manual group comprised two radiologists: an experienced senior obstetric radiologist (N.Liu, Senior) with over 10 years of expertise, and a junior radiologist (X.Han, Junior) with 3 years of experience in obstetrics.All manual group measurements have been annotated manually by Senior and Junior using the Pair [28] annotation software package.The automatic group inputs the corresponding images into the CUPID software, which automatically obtains the measurements and estimates the GA and fetal weight.The performances of the manual and automatic groups were studied concerning the following measurements: BPD, OFD, HC, LVW, TCD, PCFW, AC, FL, HL, GA and EFW.All examiners were blinded to the measurements obtained during ultrasound examination.The measurements of the Senior (N.Liu) were used as the gold standard to compare the performance of the Junior (X.Han) and CUPID.
An annotation process was conducted in three distinct phases.In phase 1, a subjective clinical assessment was performed once on each image by Senior to determine whether the caliper placement of the CUPID and Junior was correct.Caliper position was classified as either a good fit or an adjustment required [11].In phase 2, we performed objective assessments to compare the consistency and relative error between radiologists with different seniority levels and CUPID for the biometric measurements.These measurements included BPD, OFD, HC, LVW, TCD, PCFW, AC, FL, and HL.This evaluation aimed to determine whether CUPID met clinical practice standards.Examples of the manual and automatic measurement results for the nine biometric measurements and CUPID's product interface are shown in Figs. 2 and 3.In addition, the measured values of BPD, HC, TCD, AC, and FL were used to determine the gestational age and estimated fetal weight based on the Hadlock formula [23].In phase 3, the time-consuming to measure each parameter was recorded for manual and automatic groups, enabling a comprehensive evaluation of the efficiency of CUPID in fetal biometric estimation.

The development of commercial CUPID software
To develop CUPID into a mature commercial software, we initially conducted tasks such as model training and performance evaluation.The dataset used for training CUPID's artificial intelligence algorithm was independently collected and established at different stages of software development.The algorithm employed by CUPID is based on a traditional convolutional neural network architecture.Dataset used to train CUPID encompasses Fig. 1 Flowchart summarizing the study design 5000 cases, divided into training, validation, and test sets in a 7:1:2 ratio.After rigorous preliminary clinical trials conducted across multiple centers, we found that CUPID's performance in nine key measurement items is highly consistent with manual measurements by experts, with a consistency coefficient exceeding 0.9.To enhance the performance of CUPID, we opted to deploy the CUPID software on the Nvidia Clara AGX developer kit1 with RTX 6000.The kit was purpose-built for medical instruments that require advanced computation to support various real-time workloads that come with a fully tested operating system and drivers.To improve the performance further, Nvidia TensorRT techniques are also adopted in product settings.These techniques help successfully achieve twice the acceleration compared to the ONNXRuntime architecture [29].

Statistical analysis
This study utilized intra-and inter-class correlation coefficients (ICC), Pearson correlation coefficient (PCC), and Bland-Altman (BA) plots to assess repeatability and reproducibility.The mean absolute error (MAE) and percentage error (PE) were also used to understand the variability associated with individual measurements.A paired Welch's two-sample test was used to analyze the statistical significance between the groups, and a significance level of < 5% was considered statistically significant.Statistical analyses were performed using SPSS version 26.0.

Characteristics of the study
A total of 642 fetuses were recruited after providing consent for the study.The mean gestational age of the fetuses was 22 weeks ±2.82 (SD) (range: 18-24 weeks gestation).Table 1 lists the factors related to pregnant women and fetus characteristics.Maternal age, median cervical canal length, history of cesarean section, previous open abdominal surgery, and prior laparoscopic abdominal surgery have also been reported.Additionally, the fetus information comprises gestational week, placental position, and the deepest vertical pocket of amniotic fluid (DVP).

The intra-observer reproducibility
Intra-observer reproducibility was assessed for measuring 5778 biometric variables derived from 642 fetuses.As shown in Table 2, the ICC for Senior, Junior and CUPID were 0.974-0.999,0.749-0.934and 1.0 respectively.High ICC values demonstrated excellent agreement and consistency between repeated measurements, indicating strong intra-observer reproducibility.

Objective assessment
The objective measurement indicators assessed the interobserver reproducibility of measurements for 5778 biometric variables derived from 642 fetuses.Table 4 presents the results of the Senior, Junior and CUPID on the nine fetal biometric measurements and EFW, and we can find that for most of the measured results, CUPID is closer to Senior compared to Junior.The ICC, PCC, MAE, and PE values are presented in Table 5.The ICC between the Senior and CUPID (0.843-0.990) reflected better reliability than between the Senior and Junior (0.306-0.937).Meanwhile, PCC showed the same results as the ICC, demonstrating that the CUPID results have a higher linear correlation with Senior.In addition, the results for MAE and PE in Table 5 again demonstrate the reliability of CUPID. Figure 4 illustrates the correlation distribution map of the nine biometric measurements and EFW.The agreement distribution of all measurements was shown in Fig. 5. From Fig. 5, we can clearly find that CUPID and Senior are more consistent, especially for the measurement items TCD, LVW, AC, FL, HL and EFW.The comparisons of error days in the true gestational age for Senior, Junior and CUPID were in Table 6, and Fig. 6 illustrates the error curve for gestational age estimation.CUPID also showed superior performance in estimating gestational age compared to Junior.In conclusion, these results clearly demonstrate a high consistency and correlation between the measurements obtained through CUPID and Senior.

Discussion
Fetal growth and development are essential aspects of antenatal care.Fetal ultrasound plays an important role in assessing these conditions through multiple biometric measurements, that rely on the expertise of the operator [6].We successfully developed a novel artificial intelligence assistance software, CUPID, to automatically measure nine crucial fetal biometric variables obtained in the mid-trimester.In this comprehensive comparative study, we evaluated the placement of CUPID and the accuracy of its measurements to determine its precision in fetal biometry of standard plane and its predictive ability for GA and EFW.The study analyzed images that had already undergone quality control.It included  radiologists of different seniority levels, and CUPID consistently outperformed Junior while approaching Senior's performance levels across all biometric measurements.CUPID system stands out in the terms of efficiency as it can measure all nine biometric in less than 1 second.It is a highly efficient and advantageous option for saving measurement time.
Inconsistencies in doctors' skills can result in measurement and diagnostic errors, emphasizing the significance of quality control in fetal biometric data.Quality control Fig. 5 The Bland-Altman plot shows the agreement between Senior and CUPID, as well as between Senior and Junior, regarding the measurement of nine key fetal biometric parameters and EFW (red dots: senior and CUPID, blue dots: Senior and Junior) Automated measurements conducted in the mid-trimester offer a valuable means of enhancing the dependability of various assessments [11,30].We found that CUPID maintained a high degree of consistency with the Senior in measurements, and when compared with true gestational age, the results of HC and AC were better than those of Senior's.The analysis found that this may be because when doctors perform HC or AC measurements, they employ ellipse fitting [12].In contrast, CUPID is fitted through the complete boundaries of HC or AC, which is closer to actual development.Similarly, we evaluated CUPID's measurement accuracy for subtle intracranial structures, which are the indicators that need to be assessed during the mid-trimester.CUPID's performance was excellent because it could accurately identify the lateral ventricles and cerebellum, providing diagnostic assistance for potential CNS conditions.However, its performance in measuring PCFW was slightly inferior to other indicators.The analysis revealed that PCFW has a high degree of structural variability, which making it more challenging for AI to learn.However, Junior's performance in these subtle structures was extremely poor because Junior lacked experience correctly identifying and measuring them.Junior consistently measured the cerebellum as smaller and the lateral ventricles as larger, which could potentially lead to false positives and unnecessary examinations during clinical diagnosis.Therefore, by comparing the consistency of different years of experience and CUPID in all nine biometric measurements, we found that CUPID's performance is closer to Senior, and better in some measurement items.All CUPID measurements showed an error of less than 6 days compared to the true gestational age.According to the literature, a predictive error of ±10 days for gestational age in the mid-trimester is acceptable in the most clinical settings [31].It takes approximately 0.5 seconds to perform all nine fetal biometric measurements using CUPID.Therefore, CUPID is reliable and reproducible automated software for clinical applications.The use of CUPID can reduce work-related musculoskeletal disorders resulting from repetitive movements.
It is important to acknowledge the limitations of this study despite these compelling findings.Primarily, our investigation was confined to fetuses exhibiting normal intracranial anatomy, which may constrain the broad applicability of our findings to a diverse population.Second, since this study was conducted in a single-center setting, it is necessary to conduct further multicentric validation to ascertain the universal applicability of CUPID's AI measurement across multiple ethnicities and geographical demographics.Finally, this study did not perform a direct comparison between CUPID's and radiologists' measurements on images that were not subjected to quality control; this might have led to missing some of the real clinical situations in which the two performances were compared.
In conclusion, the CUPID software has demonstrated exceptional accuracy and efficiency in automatically measuring fetal biometry, gestational age, and fetal weight.This automatic intelligent measurement software provides a rapid and precise method for evaluating fetal growth and development.

Fig. 2
Fig. 2 Examples of measurement results obtained by Senior, Junior and CUPID for the nine key fetal biometric parameters.BPD, biparietal diameter; HC, head circumference; OFD, occipitofrontal diameter; TCD, transverse cerebellar diameter; PCFW, posterior cranial fossa pool width; LVW, lateral ventricles width; FL, femoral length; HL, humeral length; AC, abdominal circumference (The red arrow indicates the location of the measurement error)

Fig. 4
Fig.4 The Pearson correlation coefficient plot shows the agreement between Senior and CUPID, as well as between Senior and Junior, regarding the measurement of nine key fetal biometric parameters and EFW (blue dashed line represents Senior, red line represents CUPID, and blue solid line represents Junior)

Table 1
Clinical characteristics of 642 patients undergoing routine prenatal screening in mid-trimester (18-24 weeks of gestation)

Table 2
Intra-observer reproducibility of nine key fetal biometric parameters by a Senior, Junior, and CUPID

Table 3
Subjective assessment of the clinical acceptability of caliper placement by CUPID and junior for measuring nine biometric parameters (n = 642), n denotes the number of participants

Table 4
Measurement results of nine key fetal biometric parameters and EFW present with the mean ± SD format.The nine key fetal biometric parameters in mm, and EFW in g

Table 5
Quantitative evaluation of nine key fetal biometric parameters and EFW.(ICC: inter-class correlation coefficients, PCC: Pearson correlation coefficient, MAE: mean absolute error, PE: percentage error)

Table 6
Comparison of error days in true gestational age for Senior, Junior and CUPID Trend curve of estimated gestational age error in days measured by CUPID, Senior, and Junior

Table 7
A summary of the time-consuming to measure each parameter by Senior, Junior, and CUPID in seconds