Comparison of screening accuracy of the Patient Health Questionnaire-2 using two case-identification methods during pregnancy and postpartum
BMC Pregnancy and Childbirth volume 20, Article number: 211 (2020)
Variation exists regarding perinatal depression screening. A two-step screening method has been recommended. According to a maternity-focused core outcome set developed by the International Consortium for Health Outcomes Measurement, women who score 3 or more on the PHQ-2 then complete the Edinburgh Postnatal Depression Scale (EPDS). Limited evidence exists regarding the screening accuracy of the PHQ-2 in childbearing women. An alternative case-identification method may be more sensitive for perinatal women. We aimed to  evaluate the screening accuracy of the PHQ-2 during the perinatal period using two case-identification methods, and  measure the variability of accuracy over four time-points during pregnancy and postpartum.
A prospective, longitudinal cohort study was conducted with 309 consecutive women who completed the PHQ-2 and EPDS during pregnancy (booking, 36-weeks) and postpartum (6-, 26-weeks). EPDS was the reference standard using cut-off scores for ‘at least probable minor depression’ during pregnancy (≥ 13) and postpartum (≥ 10) and for ‘probable major depression’ during pregnancy (≥ 15) and postpartum (≥ 13). PHQ-2 was analysed using two methods:  scored (cut-points ≥ 2 and ≥ 3),  dichotomous yes/no (positive response to either question) against EPDS cut-points for at least probable minor and probable major depression. Receiver operating characteristic analyses determined accuracy.
Probable major depression: Over four timepoints PHQ-2 ≥ 3 revealed lowest sensitivity (36–79%) but highest specificity (94–98%). An alternative case-identification method revealed high sensitivity (93–100%), but lowest specificity (58–71%). Minor depression: PHQ-2 ≥ 3 revealed the lowest sensitivity (19–50%) but highest specificity (95–98%). An alternative case-identification method revealed the highest sensitivity (81–100%) and moderate specificity (60–74%).
Recommended method of case-identification (PHQ-2 ≥ 3) missed an unacceptable number of women at-risk of depression. As a clinical decision-making tool, an alternative, dichotomous method maximized case-identification and is recommended. Further, the literature identified inconsistent reporting of the PHQ-2 and the alternative case-identification method hindering the ability to synthesise data. The future use and reporting of consistent question wording and response format will improve outcome reporting and synthesis. Further research in larger and diverse maternity populations is recommended.
In Australia, around one in ten women experience depression during pregnancy  and one in six during the year following birth . Untreated maternal depression has been consistently associated with poorer outcomes for infants including impaired attachment and cognitive deficits, with effects still adversely impacting on children at age 16, especially boys . If not addressed, perinatal depression can create intergenerational difficulties . In extreme cases, women may attempt or complete suicide or infanticide [5, 6]. The high burden of perinatal depression demands effective strategies to prevent and improve symptoms. Significant variation in depression outcomes, measures, and case definitions in comparative effectiveness trials limits data synthesis , and contributes to significant research wastage in perinatal research .
To address such issues the Core Outcome Measures in Effectiveness Trials (COMET)  and the Core Outcomes in Women’s and Newborn health (CROWN)  Initiatives advocate a standardized research approach using core outcome sets. A core outcome set is an agreed set of outcomes that should be measured and reported, as a minimum, in all clinical trials of specific health or health care . In 2016 the International Consortium for Health Outcomes Measurement (ICHOM) published a standard set of outcome measures to evaluate value in maternity care . Standard sets are the same as core outcome sets but have a clear focus on clinical practice . An ICHOM working party comprising two consumers and nineteen international experts convened to identify outcomes and measurement instruments for inclusion in their set . Using a modified Delphi technique and consensus process, mental health was identified as an outcome important to women, and the Patient Health Questionnaire (PHQ-2)  and Edinburgh Postnatal Depression Scale (EPDS)  were identified as the most appropriate measures of symptoms of perinatal depression to be included in the set.
ICHOM [12, 14] recommends the two-item PHQ-2 as a case-identification method for all women, followed by the 10-item EPDS only for women who screen positive on the PHQ-2 (defined cut-point of 3 or more). The sensitivity and specificity of the PHQ-2 to identify childbearing women at risk of depressive symptoms, as defined by accepted cut-points on the EPDS, is under-researched. As part of a core outcome set designed for use in clinical practice, high sensitivity on the PHQ-2 is vital to ensure all at-risk women also receive the EPDS. While probable major depression is the focus of ICHOM’s recommendation, minor depressive symptoms are also linked to poor quality of life and are important to consider during clinical decision-making .
Clinical recommendations for depression screening
Screening women for depression during the perinatal period may reduce depressive symptoms and prevalence  but there lacks international consensus regarding the best approach. In Australia, where the current study is conducted, a universal screening approach is recommended at least twice during pregnancy (early and late pregnancy), and at least twice in the first postpartum year . While a universal screening approach is also recommended in the United States , and Canada , the United Kingdom (UK) recommend selective screening adjunct to clinical practice . Like the ICHOM approach, UK clinicians ask two case identification questions at first contact during pregnancy and again during the early postpartum period with further assessment only for women who respond positively to either question.
Relevant depression screening instruments
The EPDS is the most widely-used, evaluated and validated measure of depressive symptoms [22,23,24]. Most clinical guidelines, including Australia, recommend the use of the EPDS, either as a primary [19, 21] or second-step screen  to inform clinical decision making. In terms of case-identification methods, two questions originating from the PRIME-MD diagnostic interview , are asked:
During the past month, have you been bothered by little interest or pleasure in doing things?
During the past month, have you been bothered by feeling down, depressed or hopeless?
The questions can be asked using two formats that differ in terms of timing and response format. When framed to recall symptoms over the past month, using a ‘yes’ or ‘no’ response, the questions are known as the Whooley Questions . In contrast, when framed to recall over the past 2 weeks, using a four-item response, the questions are known as the Patient Health Questionnaire-2 . While ICHOM recommends the use of the PHQ-2 for case-identification, guidance in the UK recommends the Whooley Question approach. Evidence regarding the diagnostic accuracy of the two approaches come from two systematic reviews conducted in general populations. Findings from Manea et al.,  showed the PHQ-2 to have moderate sensitivity (76%) at the recommended cut-point of three or more which improved to 91% at a lower cut point. Bosanquet and colleagues showed the Whooley Questions approach to have the highest sensitivity (95%) . In terms of maternity-focused evidence, a review conducted by Mann and Gilbody  included both case-finding methods (PHQ-2 and Whooley Questions) to detect postpartum depression. With only one included paper, these authors concluded limited evidence in support of the case-finding questions to detect postpartum depression with more research needed. To inform the ongoing implementation of the ICHOM core outcome set in clinical practice, the current study aimed to: (1) evaluate the screening accuracy of the PHQ-2 during the perinatal period using two case-identification methods (reference standard: EPDS), and (2) measure the variability of accuracy over four time-points during pregnancy and postpartum.
The current study is part of a larger body of work. The MoMeNT study (Models Meeting Needs over Time) is a prospective, longitudinal, cohort study which aimed to (1) evaluate the effectiveness of midwife continuity of carer on perinatal mental health and mother infant bonding and (2) assess the feasibility of the ICHOM core outcome set in the Australian context and is fully described elsewhere . Feasibility of the ICHOM core outcome set includes the psychometric evaluation of included measures. The current study is designed to address our feasibility aim and is reported in accordance with STARD (Standards for Reporting of Diagnostic) criteria [31, 32], see STARD Checklist [Additional File 1].
Setting, participants and sample size
Participants were recruited from one publicly-funded tertiary referral hospital in south-east Queensland. Participants were required to be English-literate, aged 18 years or older, 27-weeks gestation or less and have access to email and mobile phone. Women under the current care of a psychiatrist were excluded. Data collection took place between August 2017 and January 2019. Sample size was calculated to evaluate the broad effect of model of maternity care on maternal health and wellbeing. To identify a mean difference (two-tail) with a 50% effect size, 5% estimated error and 95% power, 210 participants were required. To allow for 20% attrition 252 participants were needed.
Defining perinatal mental health
Depression was operationalized using the definitions outlined by the American Psychiatric Association . Depression describes the presence of sad, empty, irritable mood, accompanied by somatic and cognitive changes that significantly affect the individual’s capacity to function. Anhedonia describes markedly diminished interest or pleasure in almost all activities. Major depression is defined as the presence of five or more symptoms (depressed mood, anhedonia, weight change, sleep disturbance, psychomotor problems, lack of energy, excessive guilt, poor concentration, suicidal ideation) present during the same 2-week period and represent a change from previous functioning with at least one of the symptoms being either depressed mood or loss of interest or pleasure. Minor depression is defined as the presence of at least two depressive symptoms but does not meet criteria for major depression.
Surveys included the PHQ-2 and EPDS. Woman-reported socio-demographic data known to influence maternal mental health were collected at baseline including age (years), parity (number of births after 20 weeks gestation), gestation (weeks), educational attainment (low: secondary school year 12 or less; high: completed apprenticeship, diploma or tertiary degree), relationship status (single or in a relationship), and weekly combined income (low: less than $1500: high: $1500 or more).
Patient Health Questionnaire (PHQ-2)
The PHQ-2  screens for possible depression and anhedonia. The stem question asks, “Over the last 2 weeks, how often have you been bothered by any of the following problems?” The two items are, “little interest or pleasure in doing things” and “feeling down, depressed, or hopeless”. For each item, the response options are “not at all” (0), “several days” (1), “more than half the days” (2), and “nearly every day” (3). PHQ-2 scores range from 0 to 6. Higher score represents greater depressive symptoms.
The Edinburgh Postnatal Depression Scale (EPDS)
The 10-item EPDS is a self-report measure to screen women for symptoms of depression during pregnancy  and postpartum . Questions are framed, ‘In the past 7 days …. ’ with a frequency-based response scored on a four-point Likert scale, scored from 0 to 3. Recommended cut-off scores for ‘at least probable minor depression’ during pregnancy (≥ 13) and postpartum (≥ 10) and for ‘probable major depression’ during pregnancy (≥ 15) and postpartum (≥ 13) were used . EPDS question items and screening accuracy at several cut-point as reported by NICE  are presented [Additional file 2].
Consecutive, eligible women attending antenatal care with a midwife were approached about the study. Women who provided written informed consent were sent a survey link by email and text message. Women who failed to respond were sent two friendly reminders and one telephone follow-up call, 2 - 3 days apart. Follow-up surveys were sent at 36-weeks, and 6- and 26-weeks postpartum. Women who failed to respond to two consecutive surveys were deemed lost to follow-up. All surveys provided information and contact numbers for national support groups. Participants were also offered the opportunity to contact the project manager to discuss any negative feelings. Survey completion occurred outside of clinical care where universal screening for depression is attended. Women who screened positive were not followed up but helpline information was provided in each survey. Ethical approval was granted from relevant Hospital and Health Service (HREC/17/QGC/127) and University Human Research Ethics Committees (GU Ref No: 2017/625).
Approach to analysis
Using SPSS version 25  the PHQ-2 was analysed using two methods: (a) total score at two cut-points (≥ 2 and ≥ 3) and (b) dichotomous categorical variable (positive response to either or both questions). Consistent with the work of others [37, 38], the EPDS was the reference standard. EPDS total score was transformed to a binary variable to indicate at least probable minor depression using the cut-off score of ≥ 13 = positive during pregnancy, and ≥ 10 = positive postpartum. Cut-off scores of ≥ 15 and ≥ 13 were used to denote probable major depression during pregnancy and postpartum respectively . Group differences in mental health during pregnancy and postpartum and for non-completing women at 26-weeks, were assessed using chi-square. Effect size was interpreted using Cohen’s criteria . Missing data were managed using listwise deletion for computing total scale scores and Cronbach’s α and pairwise deletion for all other analyses. Prevalence of depression risk according to the EPDS and positive PHQ-2 responses are presented as frequencies and percentages with 95% confidence intervals using Clopper Pearson Exact Tests for binary probability . Change in proportion of participants who screened positive on the EPDS and PHQ-2 across the four time-points (variability of accuracy) were assessed using Cochran’s Q Test. Significance values were adjusted by the Bonferroni correction for multiple tests. Significance was p = <.05. For brevity the term ‘minor depression’ is used to denote ‘at least probable minor depression’ and ‘major depression’ is used to denote ‘probable major depression’.
The internal consistency of the EPDS and PHQ-2 were assessed at all four time-points. Cronbach’s alpha coefficients (α) exceeding 0.7 were considered acceptable . For scales with few items the mean inter-item correlation (MIC) is more accurate and reported for the PHQ-2. An ideal range was considered 0.2–0.4 .
Screening performance of the PHQ-2, defined as ‘area under the curve’ (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR) and negative likelihood ratio (NLR), were assessed against the EPDS total score cut-points using ROC (Receiver Operating Characteristic) analysis  at all four time-points. The standard error for the area was set as non-parametric with a 95% confidence interval. Area under the curve (AUC) was interpreted according to the criteria by Tape  as: AUC = 0.60–0.70 = poor, 0.70–0.80 = fair, 0.80–0.90 = good, 0.90–1.0 = excellent. For the purposes of informing the ICHOM core outcome set and achieve a maximum likelihood that all ‘at risk’ women would be administered the EPDS, the optimal cut-off value on the PHQ-2 was set at 100% sensitivity.
A STARD diagram presents data pertaining to recruitment, attrition and cross-tabulation of index test/reference standard (Fig. 1). The first survey was commenced by 309 pregnant women between 10 and 27-weeks gestation (M = 19.7, SD = 3.7). Table 1 presents group differences for women exceeding the EPDS cut-point for at least probable minor depression during pregnancy and postpartum and those below the cut-point. In early pregnancy, women who exceeded the cut-point (minor depression) were more likely to report a past history of mental health disorder (18.6% vs 1.5%, medium effect), report current cigarette use (37.5% vs 3.1%, medium effect) and lower income (7.8% vs 1.9%, small effect), compared to their non-depressed counterparts. At 26-weeks postpartum, only a history of mental health disorder remained significant (33.3% vs 10.8%, medium effect). There were no significant differences between women who remained in the study and those who did not (Table 1).
Internal consistency reliability
The internal consistency reliability of the EPDS was: α = .85 during pregnancy at both time-points (baseline and 36-weeks), and α = .89 and .86 following birth (6- and 26-weeks). The PHQ-2 demonstrated high mean inter-item correlations (MIC) both during pregnancy (MIC = .55 and .59) and postpartum (MIC = .60, and .51).
Incidence of EPDS and PHQ-2 positive screen tests
Table 2 presents the frequency and percentage (with 95% CI) of women with probable major depression and at least probable minor depression (EPDS) and positive screens (PHQ-2) at four time points over pregnancy (baseline, 36-weeks) and postpartum (6-, 26-weeks). The incidence of probable major depression was 2.3 and 1.8% during pregnancy (EPDS ≥ 15) and 5.6 and 6.2% postpartum (EPDS ≥ 13). The incidence of at least probable minor depression was 4.0 and 4.4% during pregnancy (EPDS ≥ 13) and 13.1 and 14.1% postpartum (EPDS ≥ 10). Positive screens on the PHQ-2 were highest using the alternative dichotomous (yes/no) method (incidence ranging 32.9–42.9%) and lowest using a cut-point of ≥ 3 (incidence ranged 4.4–7.5%) (Table 2).
Figure 2 presents data in a visual format. Fig. 2a shows probable minor depression (EPDS) was relatively stable during pregnancy and increased following birth (6-weeks). Cochran’s Q Test revealed the difference was significant (n = 207, Q = 34.82, df 3, p < .001). Pairwise comparisons confirmed the difference was significant from 36-weeks of pregnancy to 6-weeks postpartum (p < .001). Similar results were seen for major depression (n = 207, Q = 11.05, df 3, p = .01). Pairwise comparisons showed the difference was again seen between 36-weeks of pregnancy and 6-weeks postpartum (p = .01). No significant differences were observed for change in proportions of participants who screened positive for PHQ-2, regardless of method.
Mood and anhedonia during pregnancy and postpartum (PHQ-2)
Figure 2b shows the incidence of anhedonia (PHQ-2: positive Q1, negative Q2) increased during pregnancy before reducing sharply in the early weeks following birth. Cochran’s Q Test revealed the difference was significant (n = 207, Q = 14.52, df 3, p = .002). Pairwise comparisons confirmed the difference was significant between booking and 36-weeks of pregnancy (p = .01), and between 36-weeks of pregnancy and 6-weeks postpartum (p = .01). No significant difference was seen for depressed mood (positive Q2, negative Q1), or for depressed mood and anhedonia (positive Q1 and Q2).
Screening accuracy: ROC analyses
Table 3 presents findings of the ROC analysis for the PHQ-2 to detect probable major depression during pregnancy (EPDS ≥ 15) and postpartum (EPDS ≥ 13) and at least probable minor depression during pregnancy (EPDS ≥ 13) and postpartum (EPDS ≥ 10). At the ICHOM recommended cut-point of ≥ 3 for major depression the AUC was fair during pregnancy (AUC = .77), good in early postpartum (AUC = .88) and poor in late postpartum (AUC = .67). Reducing the cut-point to ≥ 2 improved the diagnostic accuracy (AUC = .88–.93) as did the dichotomous method (AUC = .79–.86). For minor depression, the PHQ-2 cut-off ≥ 3 was poor to fair (AUC = .58–.74). Reducing the cut-point to ≥ 2 improved diagnostic accuracy (AUC = .76–.94) as did the alternative dichotomous method (yes/no) (AUC = .78–.84).
Antepartum probable major depression: PHQ-2 screening accuracy
Table 4 presents the screening accuracy of the PHQ-2 at 4 time-points for probable major depression. During pregnancy (booking, 36-weeks) the ICHOM recommended PHQ-2 cut-point of ≥ 3, correctly classified 57 and 60% of women with probable major depression but consequently missed 43 and 40% of at-risk women. Some 3.4 and 6.3% of women with no probable major depression would have been asked to complete the EPDS. A lowered PHQ-2 cut-point correctly classified 100% of women with probable major depression at both booking and 36-weeks. However, 14.2 and 19.6% of women under the threshold for probable major depression would have been asked to complete the EPDS. Using the alternative categorical approach (positive response to either Q1 or Q2), though 100% of women with probable major depression were correctly identified, the EPDS would have been administered to 32.4 and 41.9% of women under the threshold. Among those who screened positive on the PHQ-2, the probability of having probable major antepartum depression (EPDS ≥ 15) at booking was greatest using a cut-point ≥ 3 and lowest using the alternative method (PPV = 29% vs 7%, respectively). Similar results were seen at 36-weeks (PPV = 15% vs 4%) (Table 4).
Postpartum probable major depression: PHQ-2 screening accuracy
Following birth (6-weeks, 26-weeks), the PHQ-2 cut-point of ≥ 3 correctly classified 79 and 36% of women with probable major depression but missed 21 and 64% high-risk women. Only 3.4 and 2.3% of women under the threshold for probable major depression would have been asked to complete the EPDS. At a lowered cut-point of ≥ 2, 100% of women with probable major depression were correctly classified at baseline. While 86% of women with probable depression were correctly classified at 26-weeks, 14% of high-risk women were missed. At the lowered cut-point 15.1 and 10.8% of women with no probable major depression would have been asked to complete the EPDS. While the alternative method achieved the highest overall sensitivity following birth, correctly classifying 100% of women at 6-weeks and 93% of women at 26-weeks, some 29 and 30% of women with no probable depression would have been asked to complete the EPDS. Among those who screened positive on the PHQ-2, the probability of having major postpartum depression (EPDS ≥ 13) at 6-weeks was greatest using a cut-point ≥ 3 and lowest using the alternative categorical method (PPV = 58% vs 17%, respectively). Similar results were seen at 26-weeks (PPV = 50% vs 17%) (Table 4).
Antepartum probable minor depression: PHQ-2 screening accuracy
Table 5 presents the screening accuracy of the PHQ-2 at 4 time-points for at least probable minor depression. During pregnancy (booking, 36-weeks) a PHQ-2 cut-point of ≥ 3, correctly classified 50% of women with probable minor depression at both time-points but 50% of at-risk women were missed. Some 2.7 and 5.3% of low-risk women would have been asked to complete the EPDS unnecessarily. A lowered PHQ-2 cut-point correctly classified 100% of women with probable minor depression at both booking and 36-weeks. However, 12.7 and 17.5% low-risk women would have been asked to complete the EPDS. Using the alternative categorical approach (positive response to either Q1 or Q2), though 100% of women with probable minor depression were correctly identified, the EPDS would have been administered to 31.2 and 40.3% of women below the EPDS threshold for probable minor depression. Among those who screened positive on the PHQ-2, the probability of also screening positive on the EPDS for at least probable minor antepartum depression (EPDS ≥ 13) at booking was greatest using a cut-point ≥ 3 and lowest using the alternative method (PPV = 43% vs 12%, respectively). Similar results were seen at 36-weeks (PPV = 30% vs 10%) (Table 5).
Postpartum probable minor depression: PHQ-2 screening accuracy
Following birth (6-weeks, 26-weeks), the PHQ-2 cut-point of ≥ 3 correctly classified 39 and 19% of women with at least probable minor depression but missed 61 and 81% of at-risk women. Only 2.1 and 1.5% of women under the threshold for minor depression would have been asked to complete the EPDS. At a PHQ-2 lowered cut-point of ≥ 2, 73 and 59% of women with probable minor depression were correctly classified at baseline and 26-weeks respectively. Consequently, some 27 and 41% of at-risk women were missed. At the lowered cut-point, 8.9 and 6.1% of low-risk women would have been asked to complete the EPDS unnecessarily. While the alternative method achieved the highest overall sensitivity following birth, correctly classifying 82% of women at 6-weeks and 81% of women at 26-weeks, 19% of low-risk women would have been asked to complete the EPDS at both postpartum time-points. Among those who screened positive on the PHQ-2, the probability of also screening positive for at least probable minor postpartum depression (EPDS ≥ 10) at 6-weeks was greatest using a cut-point ≥ 3 and lowest using the alternative categorical method (PPV = 68% vs 33%, respectively). Similar results were seen at 26-weeks (PPV = 60% vs 34%) (Table 5).
This study evaluated the screening accuracy of the PHQ-2 to inform the future use of the ICHOM core outcome set for pregnancy and childbirth in clinical practice and research. Two methods of case-identification were used: a scoring method with two cut-points (≥ 2 / ≥ 3) and an alternative dichotomous method. The EPDS was used as the reference standard at recommended cut-points [16, 34, 35] and demonstrated acceptable internal consistency reliability across all four time-points (a = .85–89). In contrast the mean inter-item correlations of the PHQ-2 (MIC = .51–.60) were high , suggesting some possible overlap between anhedonia and depressive symptoms during the perinatal period.
At the ICHOM recommended PHQ-2 cut-point of ≥ 3, screening accuracy to detect probable major depression was fair during pregnancy (AUC = .77), and poor to good postpartum period (AUC = .67–.88). While the recommended cut-point demonstrated low sensitivity during pregnancy (57–60%) and postpartum (36–79%) and missed an unacceptably high number of women with probable major depression, specificity remained high (94–98%) thus minimizing response burden, an identified priority to the ICHOM team . Lowering the cut-point to ≥ 2 achieved the highest screening accuracy for probable major depression of the three methods, demonstrated by good to excellent AUC values at all timepoints (AUC = .88–.93) and high sensitivity (86–100%). Further, specificity was moderate throughout (80–89%). In contrast, while an alternative dichotomous method achieved only fair to good diagnostic accuracy (AUC .79–.86), it did demonstrate the highest overall sensitivity (100% during pregnancy and early postpartum and 93% late postpartum). Specificity was however lower than the other two methods (58–71%). The alternative method thus identified almost all at risk women but at the cost of increased response-burden.
For at least probable minor depression, screening accuracy was highest using the PHQ-2 cut-point of ≥ 2 (AUC = .76–.94) and lowest using the PHQ-2 cut-point of ≥ 3 (AUC = .58–.74). While sensitivity was lowest at a cut-point of ≥ 3 (19 to 50%) it did demonstrate the highest specificity (95–98%). In contrast, while the alternative dichotomous method had the highest sensitivity (81–100%), specificity was low (60–74%).
The ability to compare our findings was hindered by a lack of comparable perinatal studies. A systematic review which evaluated the accuracy of screening instruments for perinatal women , identified sparse evidence pertaining to the PHQ-2. Further, the reporting of two similar tools within the literature compounds the challenge of comparing findings as the Whooley Questions are often confused with, and referred to, as the PHQ-2 and vice versa [27, 28]. The two tools are also referred to, or combined, as case-finding or case-identification methods [29, 45]. As such we compared our findings to the literature pertaining to both the PHQ-2 and the Whooley Questions, where appropriate to do so.
Screening accuracy of case-identifying methods
The literature identifies two almost identical case-identifying methods – the PHQ-2 and the Whooley Questions. In terms of the PHQ-2, authors of a systematic review revealed a pooled sensitivity of 0.76 and specificity of 0.87 at a cut-point of ≥ 3 in the general population . Consistent with the current study, sensitivity improved at a cut-point of ≥ 2 (sensitivity = 0.91) and specificity decreased (70%). Findings of that review are however limited by the largely mixed-gender, and recruitment from primary/secondary settings and substantial heterogeneity between studies (I2 = 81.8%). Only one included study by Smith and colleagues  reported findings for perinatal women. Similarly, a meta-analysis of the diagnostic accuracy of the Whooley Questions to identify major depression included ten studies with community samples  including two in peripartum women [45, 47]. Consistent with the current study, Bosanquet  reported a pooled sensitivity of 95% (95% CI: 0.88–0.97), and pooled specificity of 65% (95% CI: 0.55–0.74) using the dichotomous response format. Maternity-specific evidence regarding diagnostic accuracy is sparse. With a focus on postpartum depression, though Mann and Gilbody  identified seven studies reporting either case-identification method (PHQ-2 and Whooley Questions), only one study met inclusion criteria using clinical diagnostic criteria. The included study by Gjerdingen et al.,  reported diagnostic accuracy of both the PHQ-2 and Whooley Questions. As a screening tool for major postpartum depression, Gjerdingen reported the Whooley method to have 100% sensitivity, 62% specificity, 11% PPV and 100% NPV, comparing favorably to the findings of the current study for at least probable major depression at six weeks postpartum (sensitivity: 100%, specificity: 71%, PPV: 17%, NPV: 100%). Though Gjerdingen evaluated the PHQ-2, cut-points were not evaluated preventing further comparison. In terms of the Whooley Questions, recent work by Littlewood and colleagues  revealed the alternative method (positive response to either or both questions) to demonstrate acceptable sensitivity and specificity but low predictive value during pregnancy (20 weeks: sensitivity: 85%, specificity: 83.7%, PPV: 37.4) and postpartum (3–4 months: sensitivity: 85.7%, specificity: 80.6%, PPV: 31.4) periods which is consistent with current findings. Current study findings for at least probable minor depression at similar timepoints reveal similar results following birth, but greater sensitivity during pregnancy (baseline: sensitivity: 100%, specificity: 69%, PPV: 12) and postpartum (26 weeks: sensitivity: 81%, specificity: 74%, PPV: 34). The disparity in findings likely reflect the difference in reference standards used; the current study used the EPDS, while Littlewood used the Client Interview Schedule- Revised . Further, the impact of question wording differences in terms of time format (last-2 weeks versus past month) is not yet known.
An important observation seen in the current study but not previously evaluated by others is the significant increase in anhedonia seen during pregnancy and significant decrease following birth which was not seen with the PHQ-2 depression question. It is possible that women may experience low mood as pregnancy progresses without a significant change in levels of depressive symptoms. The impact of anhedonia may then artificially inflate the number of women meeting the screening threshold. Further research to identify the impact of each item during the perinatal period is warranted.
EPDS as a reference standard
Consistent with the current study, the PHQ-2 has been evaluated as a modified dichotomous yes/no screening tool using the EPDS as the reference standard (37, 38). Using an EPDS cut-off score of ≥ 13 to denote probable major depression during pregnancy (15-weeks, 30-weeks) and postpartum (6–16 weeks), Bennett and colleagues  reported the ‘modified version of the PHQ-2’ to have high sensitivity (80–93%) and specificity (75–86%), with sensitivity being highest during pregnancy and lower postpartum. At the same EPDS cut-point (≥ 13) the current study showed a slightly higher sensitivity (93–100%) and lower specificity (60–71%). These differences may be attributed to sample characteristics and research methodology. Bennett’s cross-sectional study recruited young, low income, less educated women with higher rates of depressive symptoms compared to the current study. Of significance, Bennett reported evaluating a modified version of the PHQ-2 which were the Whooley Questions (During the past month …), further demonstrating the inconsistency and potential confusion in instrument reporting.
Consistent with the current study, Chae et al.,  evaluated the PHQ-2 using the dichotomous method against the EPDS as the reference standard using a cut-point ≥ 13 in a sample of women (n = 200) attending well-child clinics in the USA at 6 weeks and 6 months postpartum. Chae reported 100% sensitivity and 79% specificity, which compares well to the 93–100% sensitivity and 70–71% specificity in the current study. More recently, Howard et al.,  compared the Whooley Questions and the EPDS (cut-point ≥ 13) to clinical diagnostic criteria in a cross-sectional study with 545 women attending their first booking appointment. These authors reported low sensitivity for the Whooley Questions (41%) and EPDS (59%), with a concomitant high specificity (94–95%). While few comparable studies exist for the Whooley Questions, sensitivity of the EPDS is lower than usually reported and might be explained by the large sample size, diverse sample of women and delays in administering the diagnostic interview.
The ICHOM recommended cut-point of ≥ 3 on the PHQ-2 to screen for probable major depression is consistent with recommendations by the scale developer  despite being based primarily on a validation study with patients in primary and secondary care. Our findings demonstrate that at this cut-point an unacceptably high number of ‘at-risk’ women were missed and would not have received the follow-up EPDS. While a cut-point of ≥ 2 provided highest accuracy in terms of area under the curve, and optimal sensitivity and specificity during pregnancy, screening accuracy was lower following birth. Further, in clinical practice where the identification of high-risk women is crucial, we would argue that an alternative dichotomous method missed the least number of at-risk women and is the most appropriate method of case-identification during the perinatal period. Further, this method maintained the highest sensitivity even with a lowered EPDS cut-point to denote at least probable minor depression. This finding indicates that the alternative dichotomous method would better identify women with lower levels of depressive symptoms compared to the scoring method to better inform clinical-decision making. While a higher false positive rate is noted with this method, we argue that where universal screening using the EPDS is already currently practiced, this method could positively impact response burden for many women.
In Australia government-funded projects are working towards a standardized approach to outcome reporting in maternity-related practice and research to improve data-synthesis and outcomes for women and their babies. Our findings offer important evidence and recommendations to support standardization.
Strengths and limitations
Despite our best efforts, our findings do have some limitations. Our study included a cohort of women from one Australian birthing facility and identified a low prevalence of women with probable major depression during pregnancy (around 2%) and postpartum (around 6%) which contributes to a less precise measure. While our prevalence rates are slightly less than those reported by Gaynes et al.,  during pregnancy (3.1–4.9%), our findings are comparable to postpartum rates (1.0–5.9%). Recruiting larger, more diverse samples would improve precision of the prevalence estimate and generalizability of findings. We used the EPDS as the reference standard rather than a clinical diagnosis of depression which would generally be considered gold standard. However, the aim here was to evaluate the PHQ-2 against the EPDS as would occur in real-life clinical practice using the ICHOM core outcome set and not to diagnose depression. Our findings were however strengthened by the comprehensive nature of our analysis which included using two case-identification methods over four time-points. Further, we applied widely-accepted EPDS cut-points to denote both probable major depression and at least probable minor depression.
ICHOM recommend the PHQ-2 to screen women using a scoring method at a defined cut-point. The recommended method has been shown to be inadequate as a probable perinatal depression case-identification method using the EPDS as the reference standard. To optimise the number of women identified as ‘at-risk’ in clinical practice, we recommend a dichotomous two-item case-identification method which is consistent with recommendations of international guidelines  and has been shown to be acceptable to women . Further, to address the current confusion surrounding the use and reporting of the case-identification method [27, 28], we recommend the use of the Whooley Questions rather than the PHQ-2. The use and reporting of consistent question wording and response format will improve future outcome reporting and synthesis.
Availability of data and materials
The de-identified dataset used and analysed for this study is available from the corresponding author upon reasonable request so that appropriate data transfer agreements can be established.
Area under the curve
Edinburgh postnatal depression scale
International consortium for health outcomes measurement
Mean inter-item correlation
Negative likelihood ratio
Negative predictive value
Patient health questionnaire
Positive likelihood ration
Positive predictive value
Receiver operating characteristic
Buist A, Bilszta J. The beyondblue National Postnatal Screening Program, Prevention and Early Intervention 2001–2005, Final report: Vol 1: National Screening Program. Melbourne: beyondblue; 2006.
Woolhouse H, Gartland D, Hegarty K, Donath S, Brown SJ. Depressive symptoms and intimate partner violence in the 12 months after childbirth: a prospective pregnancy cohort study. BJOG Int J Obstet Gynaecol. 2012;119(3):315–23.
Murray L, Arteche A, Fearon P, Halligan S, Croudace T, Cooper P. The effects of maternal postnatal depression and child sex on academic performance at age 16 years: a developmental approach. J Child Psychol Psychiatry. 2010;51(10):1150–9.
Warfa N, Harper M, Nicolais G, Bhui K. Adult attachment style as a risk factor for maternal postnatal depression: a systematic review. BMC psychology. 2014;2(1):56.
Jefferies D, Horsfall D, Schmied V. Blurring reality with fiction: Exploring the stories of women, madness,and infanticide. Women Birth. 2017;30(1):E24–31.
Queensland Maternal & Perinatal Quality Council. Maternal & perinatal mortality & morbidity in Queensland. Herston: Queensland Health; 2015.
National Collaborating Centre for Mental Health. Antenatal and postnatal mental health. The NICE guideline on clinical management and service guidance. Updated edition. National clinical guideline 192. 2018.
Townsend R, Duffy JMN, Khalil A. Increasing value and reducing research waste in obstetrics: towards woman-centred research. Ultrasound Obstet Gynecol. 2020;55(2):151–6.
Core Outcome Measures in Effectiveness Trials (COMET). Core Outcome Measures in Effectiveness Trials [Internet]. http://www.comet-initiative.org/. Accessed 10 Dec 2019.
Core Outcomes in Women's and Newborn Health. CROWN Core Outcomes in Women's and Newborn Health [Internet]. Available from: http://www.crown-initiative.org/. Accessed 10 Dec 2019.
Williamson PR, Altman DG, Bagley H, Barnes KL, Blazeby JM, Brookes ST, et al. The COMET handbook: version 1.0. Trials. 2017;18(Suppl 3):280–50.
International Consortium for Health Outcomes Measurement (ICHOM). Pregnancy & childbirth. Data collection reference guide. Version 1.1. 2016 https://ichom.org/files/medical-conditions/pregnancy-and-childbirth/pregnancychildbirth-reference-guide.pdf. Accessed 9th April 2019.
Prinsen CAC, Spuls PI, Kottner J, Thomas KS, Apfelbacher C, Chalmers JR, et al. Navigating the landscape of core outcome set development in dermatology. J Am Acad Dermatol. 2019;81(1):297–305.
Nijagal MA, Wissig S, Stowell C, Olson E, Amer-Wahlin I, Bonsel G, et al. Standardized outcome measures for pregnancy and childbirth, an ICHOM proposal. BMC Health Serv Res. 2018;18(1):953.
Kroenke K, Spitzer RL, Janet BWW. The patient health Questionnaire-2: validity of a two-item depression screener. Med Care. 2003;41(11):1284–92.
Cox JL, Holden JM, Sagovsky R. Detection of postnatal depression. Development of the 10-item Edinburgh postnatal depression scale. Br J Psychiatry. 1987;150(6):782–6.
Herrman H, Patrick DL, Diehr P, Martin ML, Fleck M, Simon GE, et al. Longitudinal investigation of depression outcomes in primary care in six countries: the LIDO study. Functional status, health service use and treatment of people with depressive symptoms. Psychol Med. 2002;32(5):889–902.
O’Connor E, Rossom RC, Henninger M, Groom HC, Burda BU. Primary care screening for and treatment of depression in pregnant and postpartum women: evidence report and systematic review for the US preventive services task force. JAMA. 2016;315(4):388–406.
Austin M-P, Highet N. Mental health care in the perinatal period: Australian clinical practice guideline. Centre of Perinatal Excellence: Melbourne; 2017.
American College of Obstetricians and Gynecologists. ACOG Committee Opinion No. 757: screening for perinatal depression. Obstet Gynecol 2018;132(5):e208-ee12.
BC Reproductive Mental Health Program and Perinatal Services BC. Best practice guidelines for mental health disorders in the perinatal period. 2014. http://www.perinatalservicesbc.ca/Documents/Guidelines-Standards/Maternal/MentalHealthDisordersGuideline.pdf. Accessed 18 August 2019.
Gibson J, McKenzie-McHarg K, Shakespeare J, Price J, Gray R. A systematic review of studies validating the Edinburgh postnatal depression scale in antepartum and postpartum women. Acta Psychiatr Scand. 2009;119(5):350–64.
Hewitt C, Gilbody S, Brealey S, Paulden M, Palmer S, Mann R, et al. Methods to identify postnatal depression in primary care: an integrated evidence synthesis and value of information analysis. Health Technol Assess. 2009;13(36):145–145230.
Hewitt CE, Gilbody SM, Mann R, Brealey S. Instruments to identify post-natal depression: which methods have been the most extensively validated, in what setting and in which language? Int J Psychiatry Clin Pract. 2010;14(1):72–6.
Spitzer RL, Williams JBW, Kroenke K, Linzer M. deGruy FV, Hahn SR, et al. utility of a new procedure for diagnosing mental disorders in primary care: the PRIME-MD 1000 study. JAMA. 1994;272(22):1749–56.
Whooley MA, Avins AL, Miranda J, Browner WS. Case-finding instruments for depression. Two questions are as good as many. J Gen Intern Med. 1997;12(7):439–45.
Manea L, Gilbody S, Hewitt C, North A, Plummer F, Richardson R, et al. Identifying depression with the PHQ-2: a diagnostic meta-analysis. J Affect Disord. 2016;203:382–95.
Bosanquet K, Bailey D, Gilbody S, Harden M, Manea L, Nutbrown S, et al. Diagnostic accuracy of the Whooley questions for the identification of depression: a diagnostic meta-analysis. BMJ Open. 2015;5(12):e008913.
Mann R, Gilbody S. Validity of two case finding questions to detect postnatal depression: a review of diagnostic test accuracy. J Affect Disord. 2011;133(3):388–97.
Slavin V, Gamble J, Creedy DK, Fenwick J. Measuring physical and mental health during pregnancy and postpartum in an Australian childbearing population. Validation PROMIS Glob Short BMC Pregnancy Childbirth. 2019;19:370.
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351(12):h5527.
Cohen JF, Korevaar DA, Gatsonis CA, Glasziou PP, Hooft L, Moher D, et al. STARD for Abstracts: essential items for reporting diagnostic accuracy studies in journal or conference abstracts. BMJ. 2017;358:j3751.
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders: DSM-5. 5th ed. Washington, D.C: American Psychiatric Association; 2013.
Murray D, Cox JL. Screening for depression during pregnancy with the Edinburgh depression scale (EPDS). J Reprod Infant Psychol. 1990;8(2):99–107.
Matthey S, Henshaw C, Elliott S, Barnett B. Variability in use of cut-off scores and formats on the Edinburgh postnatal depression scale–implications for clinical and research practice. Arch Women's Ment Health. 2006;9(6):309–15.
Corp IBM. IBM SPSS statistics for windows, version 25.0. Armonk, NY: IBM Corp; 2017.
Bennett IM, Coco A, Coyne JC, Mitchell AJ, Nicholson J, Johnson E, et al. Efficiency of a two-item pre-screen to reduce the burden of depression screening in pregnancy and postpartum: an IMPLICIT network study. J Am Board Fam Med. 2008;21(4):317–25.
Chae SY, Chae MH, Tyndall A, Ramirez MR, Winter RO. Can we effectively use the two-item PHQ-2 to screen for postpartum depression? Fam Med. 2012;44(10):698–703.
Cohen JW. Statistical power analysis for the Behavorial sciences. 2nd ed. New York: Erlbaum; 1988.
Sergeant E. Epitools epidemiological calculators. Ausvet Pty Ltd; 2019. http://epitools.ausvet.com.au/ Accessed 11 August 2019.
Nunnally JC. Psychometric theory. 2d ed. New York: McGraw-Hill; 1978.
Briggs SR, Cheek JM. The role of factor analysis in the development and evaluation of personality scales. J Pers. 1986;54(1):106–48.
Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978:283–98.
Tape TG. Interpreting diagnostic tests [Internet]. The area under an ROC Curve. http://gim.unmc.edu/dxtests/ROC3.htm. Accessed 23 Sept 2019.
Mann R, Adamson J, Gilbody SM. Diagnostic accuracy of case-finding questions to identify perinatal depression. CMAJ. 2012;184(8):E424–E30.
Smith M, Gotman N, Lin H, Yonkers K. Do the PHQ-8 and the PHQ-2 accurately screen for depressive disorders in a sample of pregnant women? Gen Hosp Psychiatry. 2010;32(5):544–8.
Gjerdingen D, Crow S, McGovern P, Miner MP, Center B. Postpartum depression screening at well-child visits: validity of a 2-question screen and the PHQ-9. Ann Fam Med. 2009;7(1):63–70.
Littlewood E, Ali S, Ansell P, Dyson L, Gascoyne S, Hewitt C, et al. Identification of depression in women during pregnancy and the early postnatal period using the Whooley questions and the Edinburgh postnatal depression scale: protocol for the born and bred in Yorkshire: PeriNatal depression diagnostic accuracy (BaBY PaNDA) study. BMJ Open. 2016;6(6):e011223.
Lewis G, Pelosi AJ, Araya R, Dunn G. Measuring psychiatric disorder in the community: a standardized assessment for use by lay interviewers. Psychol Med. 1992;22(2):465–86.
Howard LM, Ryan EG, Trevillion K, Anderson F, Bick D, Bye A, et al. Accuracy of the Whooley questions and the Edinburgh postnatal depression scale in identifying depression and other mental disorders in early pregnancy. Br J Psychiatry. 2018;212(1):50–6.
Gaynes B, Gavin N, Meltzer-Brody S, Lohr K, Swinson T, Gartlehner G, et al. Perinatal depression: prevalence, screening accuracy, and screening outcomes. Evidence report/technology assessment no. 119. AHRQ publication no. 05-E006–2. Rockville, MD, Agency for Healthcare Research and Quality; 2005.
We would like to acknowledge the support and guidance of Dr. Julie Pallant during the statistical analysis phase of this study. We are grateful to all the staff and midwives who assisted and supported the recruitment of participants at the study site. Most importantly, we would like to acknowledge the women who participated in this study.
The MoMeNT study was supported by a grant awarded by the Gold Coast Hospital and Health Service Research Grants Committee (Ref: 015–01.02.17). The funding body played no role in the study design, data collection, data analysis or interpretation of findings.
Ethics approval and consent to participate
Ethical approval was obtained to conduct this study from Gold Coast Hospital and Health Service Human Research Ethics Committee (HREC/17/QGC/127) and Griffith University (GU Ref No: 2017/625). Written informed consent to participate was obtained from all participants.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Slavin, V., Creedy, D.K. & Gamble, J. Comparison of screening accuracy of the Patient Health Questionnaire-2 using two case-identification methods during pregnancy and postpartum. BMC Pregnancy Childbirth 20, 211 (2020). https://doi.org/10.1186/s12884-020-02891-2