Currently, population-based surveys capture limited data on maternal and newborn care and few validity studies have evaluated available or potential indicators. EN-BIRTH study across five hospitals in three countries included > 14,000 women with vaginal births observed and with exit surveys, seven times more births than any previous maternal and newborn indicator validation study. Our dataset enabled validity analyses for measurement of 33 maternal and newborn indicators comparing time-stamped gold-standard observation to exit survey-reported indicators of coverage and outcomes, with nine indicators currently included in DHS/MICS and new indicators with potential for inclusion.
Overall, we found 4 of 9 indicators already in DHS/MICS performed well in surveys. Of indicators not already in DHS/MICS, “contact” indicators for small and sick newborns (admission to a neonatal unit or KMC ward) may be useful in population-based surveys while indicators on content of clinical care had high levels of “don’t know” responses and limited validity. Where previous validation research has shown mixed results, for example uterotonics for prevention of postpartum haemorrhage [26,27,28,29,30], we found survey report under-estimated true coverage by 10% whereas survey report overestimated early initiation of breastfeeding by nearly 5 times.
This is the first validity testing for hospital-based clinical care of small and sick newborns (e.g. resuscitation, KMC, and neonatal infection management). EN-BIRTH study allowed us to assess validity for these smaller number of vulnerable newborns who needed special care such as: neonatal resuscitation (5–10%) [43, 44], KMC for newborns weighing ≤2000 g (10–20%) [45, 46] and treatment of newborn presumed severe infection (7%) [47], which have not been validated before, partly because of sample size challenges, but also because policy attention is more recent [48]. Coverage of KMC was accurate by survey-report in our study although exit survey questions on KMC were asked only for those women whose newborns were admitted to a KMC ward. Further research is required to validate this indicator for all women, including those not admitted to a KMC ward. Population-based surveys, however, even when conducted with large sample sizes, may be under-powered to measure KMC targeted to stable newborns ≤2000 g. Sample size calculations suggest that for current levels of coverage of KMC for neonates ≤2000 g (believed to be under 10%), a national household survey in Nepal would need to have a 10-fold higher sample size than the most recent DHS survey (Table 3). Usefulness of surveys for interventions in subset target groups is a function of the prevalence of the clinical need for the intervention (i.e. denominator) and coverage, thus once KMC coverage reaches over 50%, then currently used national DHS sample size may suffice.
Indicators related to treatment for presumed severe neonatal infection, particularly those related to antibiotic treatment, may be difficult to capture through surveys. Among newborns admitted for treatment of presumed severe infection, we found poor validity in questions about the baby’s diagnosis and treatment, even with short recall periods. Previous studies of survey-reported antibiotic use for childhood illness have shown that these questions perform poorly and were even worse with longer recall periods [49]. These studies also found that maternal reports of symptoms of acute respiratory infection do not provide a correct denominator for monitoring antibiotic treatment rates [50].
Admission to a neonatal unit for infection may be a useful contact point indicator as women were able to report this with high sensitivity. However, similar to KMC, this exit survey question was only asked to women with admitted newborns and further research is required to validate this indicator in a wider population. Additionally, neonatal infection questions will be subject to sample size issues similar to KMC as incidence risk of possible severe bacterial infection is estimated at 7.6% [47]. Hospital registers and records may be a better alternative for reporting coverage of interventions for small target groups such as small and sick newborns. Specific registers can be designed for documentation of treatment of infection in neonatal inpatient wards rather than only maintaining individual case record forms [51].
For indicators already present in DHS/MICS, we found sex of the baby and low birthweight were reported accurately, although birthweight is known to have issues with heaping (preferential reporting of weight with numbers ending in 00) [35, 46]. Immediate drying had very high sensitivity but very low specificity, possibly relating to the timing element. Drying was counted as “immediate” when it was observed as done within 5 min of birth while women were asked, “Was your baby dried or wiped immediately after birth (within a few minutes)?”. In qualitative interviews with women about their understanding of the word "immediate" in questions relating to immediate newborn care, McCarthy et al. found a wide range of responses including 1 or 2 min, up to 7 min, and less than 20 min [30]. Other studies have also shown immediate drying to have high sensitivity, and low or moderate specificity alone or as a composite indicator with other immediate newborn care [27, 28, 30]. Similar to other validation studies, we found early initiation of breastfeeding was largely over-estimated by survey-reported coverage. This over-estimate may be due to poor recall of the timing component if breastfeeding was initiated but not within 1 h [26, 28, 29]. Furthermore, definitions of breastfeeding may differ between clinical observers and breastfeeding women. A woman may have put her baby to the breast and considered this initiation of breastfeeding, but an observer may not have recorded breastfeeding initiation if they did not observe attachment and suckling, as breastfeeding is a complex and dynamic process [34, 37]. Survey questions on breastfeeding may be more accurate if the focus on timing is removed or shifted to something easier to recall such as place.
While interventions involving women themselves, (e.g. skin-to-skin contact or initiation of breastfeeding) had low “don’t know” responses, questions regarding clinical interventions had high levels of “don’t know” responses. These indicators had lower accuracy in survey-reports, even when the recall period was very short (exit survey) compared with 2 to 5-year recall periods expected in population-based surveys. Low accuracy may relate to not seeing an intervention happening if newborns are separated from their mothers or may relate to poor communication about care from health care workers. While a study conducted in primary health care facilities in northern Nigeria found high validity for measurement of Chlorhexidine application to newborn’s cord [27], our study showed low validity in these facilities, possibly due to not applying Chlorhexidine in front of the mother or lack of communication between health care workers and women. A detailed validation analysis for Chlorhexidine application is published elsewhere [38].
We have considered “don’t know” replies for most yes/no survey questions as “no”, consistent with DHS reporting [40]. We found, however, for clinical interventions observed coverage was high among women who responded “don’t know”. While in our study, observed coverage of these clinical interventions was high among all newborns in these facilities, true coverage among women responding “don’t know” to these questions for home births or births in smaller facilities may not be as high. Survey-reported coverage of maternal and newborn care may have improved accuracy if “don’t know” responses are excluded from both numerators and denominators.
Strengths and limitations
Strengths of this study include the large sample of more than 23,000 facility births (> 14,000 exit-surveys with women with vaginal births) across five high-burden facilities in three countries from sub-Saharan Africa and south Asia and direct observation by clinically trained researchers used as gold standard. Errors in data collection for observation were minimized by using a custom-built android application with time-stamping designed to reduce delay in recording events [52]. Data quality was promoted by refresher training and subsets of dual observation by supervisors for comparison [34]. While we did not base our assessment of validity on AUC cut-offs as our indicators were all binary (yes/no), we provide these calculations in Additional files 4 and 10.
This study's limitations included conducting the survey at the time of discharge from the hospital, in contrast to several years after birth as is often done in population-based surveys. As such, the recall bias will be minimized for our study, representing the best-case scenario and not the level of validity captured by population-based surveys. However, as surveys were conducted at the time of discharge, the busy clinical setting may have been distracting and women may have been in a hurry to return home, which may differ from the context of the population-based surveys occurring in a home setting. Some bias may have been introduced as > 5% of women were discharged before they were approached for interview. We also note that the results may not be representative of lower-level facilities since EN-BIRTH was conducted in five high-volume facilities. Additionally, observed coverage of care may have been higher due to the presence of the observer, further limiting generalisability and possibly altering women’s perception and recollection of care received [12]. In this paper we excluded the 6698 women who had caesarean sections. Since caesarean section affects both the practice of care and survey report, all our results for many of the 33 indicators would need to be split by caesarean section non-caesarean, adding even more complexity. These important analyses will be undertaken later.
The coverage of the indicators for treatment of presumed severe neonatal infection was reported from data extraction from individual case notes, as observation of admitted neonates for the whole hospital stay was not feasible. There is a possibility that a specific intervention was given but was not documented in the case notes. Despite having a large sample, there were still indicators with very high or low coverage that did not have enough observations in each column of the two-way table to report individual-level validity statistics. For those indicators, we did not report sensitivity, specificity, AUC and IF, and instead reported percent agreement [12]. The percentage agreement should be interpreted cautiously as there is the possibility of high percentage agreement for high sensitivity and low specificity of indicators that have high coverage. Additionally, high percentage agreement is also possible where an indicator has low sensitivity and high specificity with very low coverage.
Rates of caesarean sections are rising globally [53]. In our study, the caesarean section rate was 29% overall, and as high as 73% in one hospital, Azimpur BD. Women with vaginal births have different experiences from women undergoing caesareans and may experience more separation from their newborns. Caesarean birth negatively affected accuracy of survey-reported data [34,35,36,37,38]; thus this analysis has focused on vaginal birth. Further research of care and measurement among women with caesareans in this study is ongoing. Women with stillbirths were included in our survey, and coverage and measurement gaps for stillbirths are shown for specific indicators throughout this supplement series [35, 54]. The majority of women with stillbirths approached for survey consented to participate in and responded to questions on labour and birth [54], in line with other research involving women with stillbirths showing high survey completeness [55]. Women with stillbirths should be included in population-based surveys, particularly to inform action to end preventable stillbirths.
Further research is needed to understand if improving wording for some survey questions, particularly those related to clinical interventions or those with a timing component (i.e. early initiation of breastfeeding), may improve accuracy. Research on communication surrounding clinical interventions for newborn care, including small and sick newborns, is needed to understand factors contributing to accuracy of survey-reported coverage. More qualitative research regarding women’s understanding of and recall for questions related to timing, such as early breastfeeding and immediate drying, will allow us to improve question wording or indicator definitions. More process evaluation is required to better understand and improve aspects of surveys and survey burden.