Criteria for considering studies for this review
Studies were selected for inclusion in the review according to the population, index test, target condition, reference standard, outcome measure, and study design.
Studies examining singleton pregnancies in unselected or low-risk populations, conducted in comparable health care systems to Scandinavia (Northern, Western and Central Europe, USA, Canada, Australia, and New Zealand).
SF measurement compared to the SF distribution of the population.
SGA or FGR.
Diagnosis of FGR or SGA, defined as birth weight (BW) < 10th, 5th, or 3rd percentile, or ≥ one or two standard deviations (SDs) below the mean (performed postnatally).
Data required to populate 2 × 2 contingency tables.
Diagnostic cohort studies.
Search methods for identification of studies
Electronic databases (PubMed, Medline, Embase, CINAHL, Cochrane Library, and SweMEd) were searched to identify eligible diagnostic studies from the earliest year possible through September 2014. The search strategy was developed for PubMed and modified for use in other databases (see Additional file 1). The reference lists of all included publications and relevant systematic reviews were checked and forward citation searches were performed.
The search strategy involved combinations of SF-related terms appearing in subject headings and as keywords. Our Medline search query was (fund* adj height*) OR (symph* adj fund*) OR (uter* adj height*) OR (symph* adj height*) OR (gravidogram*) OR (uterus fundus height*) OR (uter* fund* height*). We conducted our search and reported our findings according to the Meta-Analysis of Observational Studies in Epidemiology and Preferred Reporting Items for Systematic Reviews and Meta-Analyses statements [16-18].
Data collection and analysis
A list of articles meeting the inclusion criteria based on abstracts was compiled. The full texts of these studies and those of uncertain relevance were retrieved. Two reviewers (ASDP and JW) independently evaluated the studies’ fulfillment of the inclusion criteria, with any discrepancy discussed with a third reviewer until a final set of relevant studies was agreed upon.
Data extraction and management
The following data were extracted from all selected studies: general information (first author, publication year, country of investigation), population (health care setting, number of participants, level of risk), study design (design, data collection), characteristics of SF height test (SF height curve, cut-off points), reference standard (SGA definition) and results (data required for the construction of 2 × 2 contingency tables). Data were entered into a database using Review Manager 5.3 software.
Assessment of methodological quality
The quality of each included study was assessed by two review authors (ASDP, JW) using the QUality Assessment of Diagnostic Accuracy Studies (QUADAS-2) checklist [19,20]. The QUADAS-2 checklist asks signaling questions in four risks of bias domains relating to patient selection, index test, reference standard, and flow and timing. Each domain is assessed in terms of risk of bias, and the first three domains are also assessed in terms of applicability. The review authors classified each item as “yes” (adequately addressed), “no” (inadequately addressed), or “unclear” (inadequate detail presented to allow a judgment to be made). The QUADAS-2 tool is shown in Additional file 2.
Statistical analysis and data synthesis
Data on sensitivity, specificity, and true-positive, false-positive, true-negative, and false-negative results were taken directly from the source papers or, if necessary, calculated from the data provided. Positive likelihood ratios (PLRs), negative likelihood ratios (NLRs), diagnostic odds ratios (DORs), and 95% confidence intervals (CIs) were calculated.
An LR describes how many times more likely it is that a person with the target condition will receive a particular test result than will a person without it. Categorization of LRs was adopted from Deeks et al.  where PLRs > 10 or NLRs < 0.1 are considered to provide convincing diagnostic evidence. The DOR is commonly used as an overall indicator of diagnostic performance and calculated as the odds of a positive test result among those with the target condition, divided by the odds of a positive test result among those without the condition. As a general rule, a DORs > 100 indicates high accuracy, values of 25–100 indicate moderate accuracy, and those < 25 indicates that the test is not useful .
The data were displayed graphically on forest and summary receiver operating characteristic (SROC) plots . The SROC curve was fitted using the hierarchical bivariate random-effects method . For studies that used more than one SF threshold, the analysis was based on the cut-off point of “one value < 10th percentile”.
Investigation of heterogeneity
Both clinical and statistical heterogeneity were evaluated. Assessment of clinical heterogeneity involved comparison of SF reference curves, cut-off criteria used to identify abnormal results, and SGA definitions. Assessment of statistical heterogeneity involved visual inspection of forest plots and calculation of the inconsistency index (I2), which describes the percentage of total variation across studies that is due to heterogeneity, rather than chance .