This study evaluated the screening accuracy of the PHQ-2 to inform the future use of the ICHOM core outcome set for pregnancy and childbirth in clinical practice and research. Two methods of case-identification were used: a scoring method with two cut-points (≥ 2 / ≥ 3) and an alternative dichotomous method. The EPDS was used as the reference standard at recommended cut-points [16, 34, 35] and demonstrated acceptable internal consistency reliability across all four time-points (a = .85–89). In contrast the mean inter-item correlations of the PHQ-2 (MIC = .51–.60) were high [42], suggesting some possible overlap between anhedonia and depressive symptoms during the perinatal period.
At the ICHOM recommended PHQ-2 cut-point of ≥ 3, screening accuracy to detect probable major depression was fair during pregnancy (AUC = .77), and poor to good postpartum period (AUC = .67–.88). While the recommended cut-point demonstrated low sensitivity during pregnancy (57–60%) and postpartum (36–79%) and missed an unacceptably high number of women with probable major depression, specificity remained high (94–98%) thus minimizing response burden, an identified priority to the ICHOM team [14]. Lowering the cut-point to ≥ 2 achieved the highest screening accuracy for probable major depression of the three methods, demonstrated by good to excellent AUC values at all timepoints (AUC = .88–.93) and high sensitivity (86–100%). Further, specificity was moderate throughout (80–89%). In contrast, while an alternative dichotomous method achieved only fair to good diagnostic accuracy (AUC .79–.86), it did demonstrate the highest overall sensitivity (100% during pregnancy and early postpartum and 93% late postpartum). Specificity was however lower than the other two methods (58–71%). The alternative method thus identified almost all at risk women but at the cost of increased response-burden.
For at least probable minor depression, screening accuracy was highest using the PHQ-2 cut-point of ≥ 2 (AUC = .76–.94) and lowest using the PHQ-2 cut-point of ≥ 3 (AUC = .58–.74). While sensitivity was lowest at a cut-point of ≥ 3 (19 to 50%) it did demonstrate the highest specificity (95–98%). In contrast, while the alternative dichotomous method had the highest sensitivity (81–100%), specificity was low (60–74%).
The ability to compare our findings was hindered by a lack of comparable perinatal studies. A systematic review which evaluated the accuracy of screening instruments for perinatal women [18], identified sparse evidence pertaining to the PHQ-2. Further, the reporting of two similar tools within the literature compounds the challenge of comparing findings as the Whooley Questions are often confused with, and referred to, as the PHQ-2 and vice versa [27, 28]. The two tools are also referred to, or combined, as case-finding or case-identification methods [29, 45]. As such we compared our findings to the literature pertaining to both the PHQ-2 and the Whooley Questions, where appropriate to do so.
Screening accuracy of case-identifying methods
The literature identifies two almost identical case-identifying methods – the PHQ-2 and the Whooley Questions. In terms of the PHQ-2, authors of a systematic review revealed a pooled sensitivity of 0.76 and specificity of 0.87 at a cut-point of ≥ 3 in the general population [27]. Consistent with the current study, sensitivity improved at a cut-point of ≥ 2 (sensitivity = 0.91) and specificity decreased (70%). Findings of that review are however limited by the largely mixed-gender, and recruitment from primary/secondary settings and substantial heterogeneity between studies (I2 = 81.8%). Only one included study by Smith and colleagues [46] reported findings for perinatal women. Similarly, a meta-analysis of the diagnostic accuracy of the Whooley Questions to identify major depression included ten studies with community samples [28] including two in peripartum women [45, 47]. Consistent with the current study, Bosanquet [28] reported a pooled sensitivity of 95% (95% CI: 0.88–0.97), and pooled specificity of 65% (95% CI: 0.55–0.74) using the dichotomous response format. Maternity-specific evidence regarding diagnostic accuracy is sparse. With a focus on postpartum depression, though Mann and Gilbody [29] identified seven studies reporting either case-identification method (PHQ-2 and Whooley Questions), only one study met inclusion criteria using clinical diagnostic criteria. The included study by Gjerdingen et al., [47] reported diagnostic accuracy of both the PHQ-2 and Whooley Questions. As a screening tool for major postpartum depression, Gjerdingen reported the Whooley method to have 100% sensitivity, 62% specificity, 11% PPV and 100% NPV, comparing favorably to the findings of the current study for at least probable major depression at six weeks postpartum (sensitivity: 100%, specificity: 71%, PPV: 17%, NPV: 100%). Though Gjerdingen evaluated the PHQ-2, cut-points were not evaluated preventing further comparison. In terms of the Whooley Questions, recent work by Littlewood and colleagues [48] revealed the alternative method (positive response to either or both questions) to demonstrate acceptable sensitivity and specificity but low predictive value during pregnancy (20 weeks: sensitivity: 85%, specificity: 83.7%, PPV: 37.4) and postpartum (3–4 months: sensitivity: 85.7%, specificity: 80.6%, PPV: 31.4) periods which is consistent with current findings. Current study findings for at least probable minor depression at similar timepoints reveal similar results following birth, but greater sensitivity during pregnancy (baseline: sensitivity: 100%, specificity: 69%, PPV: 12) and postpartum (26 weeks: sensitivity: 81%, specificity: 74%, PPV: 34). The disparity in findings likely reflect the difference in reference standards used; the current study used the EPDS, while Littlewood used the Client Interview Schedule- Revised [49]. Further, the impact of question wording differences in terms of time format (last-2 weeks versus past month) is not yet known.
An important observation seen in the current study but not previously evaluated by others is the significant increase in anhedonia seen during pregnancy and significant decrease following birth which was not seen with the PHQ-2 depression question. It is possible that women may experience low mood as pregnancy progresses without a significant change in levels of depressive symptoms. The impact of anhedonia may then artificially inflate the number of women meeting the screening threshold. Further research to identify the impact of each item during the perinatal period is warranted.
EPDS as a reference standard
Consistent with the current study, the PHQ-2 has been evaluated as a modified dichotomous yes/no screening tool using the EPDS as the reference standard (37, 38). Using an EPDS cut-off score of ≥ 13 to denote probable major depression during pregnancy (15-weeks, 30-weeks) and postpartum (6–16 weeks), Bennett and colleagues [37] reported the ‘modified version of the PHQ-2’ to have high sensitivity (80–93%) and specificity (75–86%), with sensitivity being highest during pregnancy and lower postpartum. At the same EPDS cut-point (≥ 13) the current study showed a slightly higher sensitivity (93–100%) and lower specificity (60–71%). These differences may be attributed to sample characteristics and research methodology. Bennett’s cross-sectional study recruited young, low income, less educated women with higher rates of depressive symptoms compared to the current study. Of significance, Bennett reported evaluating a modified version of the PHQ-2 which were the Whooley Questions (During the past month …), further demonstrating the inconsistency and potential confusion in instrument reporting.
Consistent with the current study, Chae et al., [38] evaluated the PHQ-2 using the dichotomous method against the EPDS as the reference standard using a cut-point ≥ 13 in a sample of women (n = 200) attending well-child clinics in the USA at 6 weeks and 6 months postpartum. Chae reported 100% sensitivity and 79% specificity, which compares well to the 93–100% sensitivity and 70–71% specificity in the current study. More recently, Howard et al., [50] compared the Whooley Questions and the EPDS (cut-point ≥ 13) to clinical diagnostic criteria in a cross-sectional study with 545 women attending their first booking appointment. These authors reported low sensitivity for the Whooley Questions (41%) and EPDS (59%), with a concomitant high specificity (94–95%). While few comparable studies exist for the Whooley Questions, sensitivity of the EPDS is lower than usually reported and might be explained by the large sample size, diverse sample of women and delays in administering the diagnostic interview.
Clinical implications
The ICHOM recommended cut-point of ≥ 3 on the PHQ-2 to screen for probable major depression is consistent with recommendations by the scale developer [15] despite being based primarily on a validation study with patients in primary and secondary care. Our findings demonstrate that at this cut-point an unacceptably high number of ‘at-risk’ women were missed and would not have received the follow-up EPDS. While a cut-point of ≥ 2 provided highest accuracy in terms of area under the curve, and optimal sensitivity and specificity during pregnancy, screening accuracy was lower following birth. Further, in clinical practice where the identification of high-risk women is crucial, we would argue that an alternative dichotomous method missed the least number of at-risk women and is the most appropriate method of case-identification during the perinatal period. Further, this method maintained the highest sensitivity even with a lowered EPDS cut-point to denote at least probable minor depression. This finding indicates that the alternative dichotomous method would better identify women with lower levels of depressive symptoms compared to the scoring method to better inform clinical-decision making. While a higher false positive rate is noted with this method, we argue that where universal screening using the EPDS is already currently practiced, this method could positively impact response burden for many women.
In Australia government-funded projects are working towards a standardized approach to outcome reporting in maternity-related practice and research to improve data-synthesis and outcomes for women and their babies. Our findings offer important evidence and recommendations to support standardization.
Strengths and limitations
Despite our best efforts, our findings do have some limitations. Our study included a cohort of women from one Australian birthing facility and identified a low prevalence of women with probable major depression during pregnancy (around 2%) and postpartum (around 6%) which contributes to a less precise measure. While our prevalence rates are slightly less than those reported by Gaynes et al., [51] during pregnancy (3.1–4.9%), our findings are comparable to postpartum rates (1.0–5.9%). Recruiting larger, more diverse samples would improve precision of the prevalence estimate and generalizability of findings. We used the EPDS as the reference standard rather than a clinical diagnosis of depression which would generally be considered gold standard. However, the aim here was to evaluate the PHQ-2 against the EPDS as would occur in real-life clinical practice using the ICHOM core outcome set and not to diagnose depression. Our findings were however strengthened by the comprehensive nature of our analysis which included using two case-identification methods over four time-points. Further, we applied widely-accepted EPDS cut-points to denote both probable major depression and at least probable minor depression.
Conclusions
ICHOM recommend the PHQ-2 to screen women using a scoring method at a defined cut-point. The recommended method has been shown to be inadequate as a probable perinatal depression case-identification method using the EPDS as the reference standard. To optimise the number of women identified as ‘at-risk’ in clinical practice, we recommend a dichotomous two-item case-identification method which is consistent with recommendations of international guidelines [7] and has been shown to be acceptable to women [48]. Further, to address the current confusion surrounding the use and reporting of the case-identification method [27, 28], we recommend the use of the Whooley Questions rather than the PHQ-2. The use and reporting of consistent question wording and response format will improve future outcome reporting and synthesis.