Comparison of screening accuracy of the Patient Health Questionnaire-2 using two case-identification methods during pregnancy and postpartum

Background Variation exists regarding perinatal depression screening. A two-step screening method has been recommended. According to a maternity-focused core outcome set developed by the International Consortium for Health Outcomes Measurement, women who score 3 or more on the PHQ-2 then complete the Edinburgh Postnatal Depression Scale (EPDS). Limited evidence exists regarding the screening accuracy of the PHQ-2 in childbearing women. An alternative case-identification method may be more sensitive for perinatal women. We aimed to [1] evaluate the screening accuracy of the PHQ-2 during the perinatal period using two case-identification methods, and [2] measure the variability of accuracy over four time-points during pregnancy and postpartum. Methods A prospective, longitudinal cohort study was conducted with 309 consecutive women who completed the PHQ-2 and EPDS during pregnancy (booking, 36-weeks) and postpartum (6-, 26-weeks). EPDS was the reference standard using cut-off scores for ‘at least probable minor depression’ during pregnancy (≥ 13) and postpartum (≥ 10) and for ‘probable major depression’ during pregnancy (≥ 15) and postpartum (≥ 13). PHQ-2 was analysed using two methods: [1] scored (cut-points ≥ 2 and ≥ 3), [2] dichotomous yes/no (positive response to either question) against EPDS cut-points for at least probable minor and probable major depression. Receiver operating characteristic analyses determined accuracy. Results Probable major depression: Over four timepoints PHQ-2 ≥ 3 revealed lowest sensitivity (36–79%) but highest specificity (94–98%). An alternative case-identification method revealed high sensitivity (93–100%), but lowest specificity (58–71%). Minor depression: PHQ-2 ≥ 3 revealed the lowest sensitivity (19–50%) but highest specificity (95–98%). An alternative case-identification method revealed the highest sensitivity (81–100%) and moderate specificity (60–74%). Conclusions Recommended method of case-identification (PHQ-2 ≥ 3) missed an unacceptable number of women at-risk of depression. As a clinical decision-making tool, an alternative, dichotomous method maximized case-identification and is recommended. Further, the literature identified inconsistent reporting of the PHQ-2 and the alternative case-identification method hindering the ability to synthesise data. The future use and reporting of consistent question wording and response format will improve outcome reporting and synthesis. Further research in larger and diverse maternity populations is recommended.


Background
In Australia, around one in ten women experience depression during pregnancy [1] and one in six during the year following birth [2]. Untreated maternal depression has been consistently associated with poorer outcomes for infants including impaired attachment and cognitive deficits, with effects still adversely impacting on children at age 16, especially boys [3]. If not addressed, perinatal depression can create intergenerational difficulties [4]. In extreme cases, women may attempt or complete suicide or infanticide [5,6]. The high burden of perinatal depression demands effective strategies to prevent and improve symptoms. Significant variation in depression outcomes, measures, and case definitions in comparative effectiveness trials limits data synthesis [7], and contributes to significant research wastage in perinatal research [8].
To address such issues the Core Outcome Measures in Effectiveness Trials (COMET) [9] and the Core Outcomes in Women's and Newborn health (CROWN) [10] Initiatives advocate a standardized research approach using core outcome sets. A core outcome set is an agreed set of outcomes that should be measured and reported, as a minimum, in all clinical trials of specific health or health care [11]. In 2016 the International Consortium for Health Outcomes Measurement (ICHOM) published a standard set of outcome measures to evaluate value in maternity care [12]. Standard sets are the same as core outcome sets but have a clear focus on clinical practice [13]. An ICHOM working party comprising two consumers and nineteen international experts convened to identify outcomes and measurement instruments for inclusion in their set [14]. Using a modified Delphi technique and consensus process, mental health was identified as an outcome important to women, and the Patient Health Questionnaire (PHQ-2) [15] and Edinburgh Postnatal Depression Scale (EPDS) [16] were identified as the most appropriate measures of symptoms of perinatal depression to be included in the set.
ICHOM [12,14] recommends the two-item PHQ-2 as a case-identification method for all women, followed by the 10-item EPDS only for women who screen positive on the PHQ-2 (defined cut-point of 3 or more). The sensitivity and specificity of the PHQ-2 to identify childbearing women at risk of depressive symptoms, as defined by accepted cut-points on the EPDS, is underresearched. As part of a core outcome set designed for use in clinical practice, high sensitivity on the PHQ-2 is vital to ensure all at-risk women also receive the EPDS. While probable major depression is the focus of ICHOM's recommendation, minor depressive symptoms are also linked to poor quality of life and are important to consider during clinical decision-making [17].

Clinical recommendations for depression screening
Screening women for depression during the perinatal period may reduce depressive symptoms and prevalence [18] but there lacks international consensus regarding the best approach. In Australia, where the current study is conducted, a universal screening approach is recommended at least twice during pregnancy (early and late pregnancy), and at least twice in the first postpartum year [19]. While a universal screening approach is also recommended in the United States [20], and Canada [21], the United Kingdom (UK) recommend selective screening adjunct to clinical practice [7]. Like the ICHOM approach, UK clinicians ask two case identification questions at first contact during pregnancy and again during the early postpartum period with further assessment only for women who respond positively to either question.

Relevant depression screening instruments
The EPDS is the most widely-used, evaluated and validated measure of depressive symptoms [22][23][24]. Most clinical guidelines, including Australia, recommend the use of the EPDS, either as a primary [19,21] or second-step screen [7] to inform clinical decision making. In terms of caseidentification methods, two questions originating from the PRIME-MD diagnostic interview [25], are asked: 1. During the past month, have you been bothered by little interest or pleasure in doing things?
2. During the past month, have you been bothered by feeling down, depressed or hopeless?
The questions can be asked using two formats that differ in terms of timing and response format. When framed to recall symptoms over the past month, using a 'yes' or 'no' response, the questions are known as the Whooley Questions [26]. In contrast, when framed to recall over the past 2 weeks, using a four-item response, the questions are known as the Patient Health Questionnaire-2 [15]. While ICHOM recommends the use of the PHQ-2 for case-identification, guidance in the UK recommends the Whooley Question approach. Evidence regarding the diagnostic accuracy of the two approaches come from two systematic reviews conducted in general populations. Findings from Manea et al., [27] showed the PHQ-2 to have moderate sensitivity (76%) at the recommended cut-point of three or more which improved to 91% at a lower cut point. Bosanquet and colleagues showed the Whooley Questions approach to have the highest sensitivity (95%) [28]. In terms of maternity-focused evidence, a review conducted by Mann and Gilbody [29] included both case-finding methods (PHQ-2 and Whooley Questions) to detect postpartum depression. With only one included paper, these authors concluded limited evidence in support of the case-finding questions to detect postpartum depression with more research needed. To inform the ongoing implementation of the ICHOM core outcome set in clinical practice, the current study aimed to: (1) evaluate the screening accuracy of the PHQ-2 during the perinatal period using two case-identification methods (reference standard: EPDS), and (2) measure the variability of accuracy over four time-points during pregnancy and postpartum.

Study design
The current study is part of a larger body of work. The MoMeNT study (Models Meeting Needs over Time) is a prospective, longitudinal, cohort study which aimed to (1) evaluate the effectiveness of midwife continuity of carer on perinatal mental health and mother infant bonding and (2) assess the feasibility of the ICHOM core outcome set in the Australian context and is fully described elsewhere [30]. Feasibility of the ICHOM core outcome set includes the psychometric evaluation of included measures. The current study is designed to address our feasibility aim and is reported in accordance with STARD (Standards for Reporting of Diagnostic) criteria [31,32], see STARD Checklist [Additional File 1].

Setting, participants and sample size
Participants were recruited from one publicly-funded tertiary referral hospital in south-east Queensland.
Participants were required to be English-literate, aged 18 years or older, 27-weeks gestation or less and have access to email and mobile phone. Women under the current care of a psychiatrist were excluded. Data collection took place between August 2017 and January 2019. Sample size was calculated to evaluate the broad effect of model of maternity care on maternal health and wellbeing. To identify a mean difference (two-tail) with a 50% effect size, 5% estimated error and 95% power, 210 participants were required. To allow for 20% attrition 252 participants were needed.

Defining perinatal mental health
Depression was operationalized using the definitions outlined by the American Psychiatric Association [33]. Depression describes the presence of sad, empty, irritable mood, accompanied by somatic and cognitive changes that significantly affect the individual's capacity to function. Anhedonia describes markedly diminished interest or pleasure in almost all activities. Major depression is defined as the presence of five or more symptoms (depressed mood, anhedonia, weight change, sleep disturbance, psychomotor problems, lack of energy, excessive guilt, poor concentration, suicidal ideation) present during the same 2-week period and represent a change from previous functioning with at least one of the symptoms being either depressed mood or loss of interest or pleasure. Minor depression is defined as the presence of at least two depressive symptoms but does not meet criteria for major depression.

Measures
Surveys included the PHQ-2 and EPDS. Womanreported socio-demographic data known to influence maternal mental health were collected at baseline including age (years), parity (number of births after 20 weeks gestation), gestation (weeks), educational attainment (low: secondary school year 12 or less; high: completed apprenticeship, diploma or tertiary degree), relationship status (single or in a relationship), and weekly combined income (low: less than $1500: high: $1500 or more).

Patient Health Questionnaire (PHQ-2)
The PHQ-2 [15] screens for possible depression and anhedonia. The stem question asks, "Over the last 2 weeks, how often have you been bothered by any of the following problems?" The two items are, "little interest or pleasure in doing things" and "feeling down, depressed, or hopeless". For each item, the response options are "not at all" (0), "several days" (1), "more than half the days" (2), and "nearly every day" (3). PHQ-2 scores range from 0 to 6. Higher score represents greater depressive symptoms.

The Edinburgh Postnatal Depression Scale (EPDS)
The 10-item EPDS is a self-report measure to screen women for symptoms of depression during pregnancy [34] and postpartum [16]. Questions are framed, 'In the past 7 days …. ' with a frequency-based response scored on a four-point Likert scale, scored from 0 to 3. Recommended cut-off scores for 'at least probable minor depression' during pregnancy (≥ 13) and postpartum (≥ 10) and for 'probable major depression' during pregnancy (≥ 15) and postpartum (≥ 13) were used [35]. EPDS question items and screening accuracy at several cut-point as reported by NICE [7] are presented [Additional file 2].

Procedures
Consecutive, eligible women attending antenatal care with a midwife were approached about the study. Women who provided written informed consent were sent a survey link by email and text message. Women who failed to respond were sent two friendly reminders and one telephone follow-up call, 2 -3 days apart. Follow-up surveys were sent at 36-weeks, and 6-and 26-weeks postpartum. Women who failed to respond to two consecutive surveys were deemed lost to follow-up. All surveys provided information and contact numbers for national support groups. Participants were also offered the opportunity to contact the project manager to discuss any negative feelings. Survey completion occurred outside of clinical care where universal screening for depression is attended. Women who screened positive were not followed up but helpline information was provided in each survey. Ethical approval was granted from relevant Hospital and Health Service (HREC/17/QGC/127) and University Human Research Ethics Committees (GU Ref No: 2017/625).

Approach to analysis
Using SPSS version 25 [36] the PHQ-2 was analysed using two methods: (a) total score at two cut-points (≥ 2 and ≥ 3) and (b) dichotomous categorical variable (positive response to either or both questions). Consistent with the work of others [37,38], the EPDS was the reference standard. EPDS total score was transformed to a binary variable to indicate at least probable minor depression using the cut-off score of ≥ 13 = positive during pregnancy, and ≥ 10 = positive postpartum. Cut-off scores of ≥ 15 and ≥ 13 were used to denote probable major depression during pregnancy and postpartum respectively [35]. Group differences in mental health during pregnancy and postpartum and for non-completing women at 26-weeks, were assessed using chi-square. Effect size was interpreted using Cohen's criteria [39]. Missing data were managed using listwise deletion for computing total scale scores and Cronbach's α and pairwise deletion for all other analyses. Prevalence of depression risk according to the EPDS and positive PHQ-2 responses are presented as frequencies and percentages with 95% confidence intervals using Clopper Pearson Exact Tests for binary probability [40]. Change in proportion of participants who screened positive on the EPDS and PHQ-2 across the four time-points (variability of accuracy) were assessed using Cochran's Q Test. Significance values were adjusted by the Bonferroni correction for multiple tests. Significance was p = <.05. For brevity the term 'minor depression' is used to denote 'at least probable minor depression' and 'major depression' is used to denote 'probable major depression'.

Internal consistency
The internal consistency of the EPDS and PHQ-2 were assessed at all four time-points. Cronbach's alpha coefficients (α) exceeding 0.7 were considered acceptable [41].
For scales with few items the mean inter-item correlation (MIC) is more accurate and reported for the PHQ-2. An ideal range was considered 0.2-0.4 [42].

Criterion validity.
Screening performance of the PHQ-2, defined as 'area under the curve' (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR) and negative likelihood ratio (NLR), were assessed against the EPDS total score cutpoints using ROC (Receiver Operating Characteristic) analysis [43] at all four time-points. The standard error for the area was set as non-parametric with a 95% confidence interval. Area under the curve (AUC) was interpreted according to the criteria by Tape [44] as: AUC = 0.60-0.70 = poor, 0.70-0.80 = fair, 0.80-0.90 = good, 0.90-1.0 = excellent. For the purposes of informing the ICHOM core outcome set and achieve a maximum likelihood that all 'at risk' women would be administered the EPDS, the optimal cut-off value on the PHQ-2 was set at 100% sensitivity.

Sample characteristics
A STARD diagram presents data pertaining to recruitment, attrition and cross-tabulation of index test/reference standard (Fig. 1). The first survey was commenced by 309 pregnant women between 10 and 27-weeks gestation (M = 19.7, SD = 3.7). Table 1 presents group differences for women exceeding the EPDS cut-point for at least probable minor depression during pregnancy and postpartum and those below the cut-point. In early pregnancy, women who exceeded the cut-point (minor depression) were more likely to report a past history of mental health disorder (18.6% vs 1.5%, medium effect), report current cigarette use (37.5% vs 3.1%, medium effect) and lower income (7.8% vs 1.9%, small effect), compared to their non-depressed counterparts. At 26-weeks postpartum, only a history of mental health disorder remained significant (33.3% vs 10.8%, medium effect).
There were no significant differences between women who remained in the study and those who did not (Table 1).

Internal consistency reliability
The internal consistency reliability of the EPDS was: α = .85 during pregnancy at both time-points (baseline and 36weeks), and α = .89 and .86 following birth (6-and 26-weeks). The PHQ-2 demonstrated high mean inter-item correlations (MIC) both during pregnancy (MIC = .55 and .59) and postpartum (MIC = .60, and .51). Table 2 presents the frequency and percentage (with 95% CI) of women with probable major depression and at least probable minor depression (EPDS) and positive screens (PHQ-2) at four time points over pregnancy (baseline, 36weeks) and postpartum (6-, 26-weeks). The incidence of probable major depression was 2.3 and 1.8% during pregnancy (EPDS ≥ 15) and 5.6 and 6.2% postpartum (EPDS ≥ 13). The incidence of at least probable minor depression was 4.0 and 4.4% during pregnancy (EPDS ≥ 13) and 13.1 and 14.1% postpartum (EPDS ≥ 10). Positive screens on the PHQ-2 were highest using the alternative dichotomous (yes/ no) method (incidence ranging 32.9-42.9%) and lowest using a cut-point of ≥ 3 (incidence ranged 4.4-7.5%) ( Table 2). Figure 2 presents data in a visual format. Fig. 2a shows probable minor depression (EPDS) was relatively stable during pregnancy and increased following birth (6-weeks). Cochran's Q Test revealed the difference was significant (n = 207, Q = 34.82, df 3, p < .001). Pairwise comparisons confirmed the difference was significant from 36-weeks of pregnancy to 6-weeks postpartum (p < .001). Similar results were seen for major depression (n = 207, Q = 11.05, df 3, p = .01). Pairwise comparisons showed the difference was again seen between 36-weeks of pregnancy and 6weeks postpartum (p = .01). No significant differences were observed for change in proportions of participants who screened positive for PHQ-2, regardless of method.

Postpartum probable major depression: PHQ-2 screening accuracy
Following birth (6-weeks, 26-weeks), the PHQ-2 cutpoint of ≥ 3 correctly classified 79 and 36% of women with probable major depression but missed 21 and 64% high-risk women. Only 3.4 and 2.3% of women under the threshold for probable major depression would have been asked to complete the EPDS. At a lowered cutpoint of ≥ 2, 100% of women with probable major depression were correctly classified at baseline. While 86% of women with probable depression were correctly classified at 26-weeks, 14% of high-risk women were missed. At the lowered cut-point 15.1 and 10.8% of women with no probable major depression would have been asked to complete the EPDS. While the alternative method achieved the highest overall sensitivity following birth, correctly classifying 100% of women at 6-weeks and 93% of women at 26-weeks, some 29 and 30% of women with no probable depression would have been asked to complete the EPDS. Among those who screened positive on the PHQ-2, the probability of having major postpartum depression (EPDS ≥ 13) at 6-weeks was greatest using a cut-point ≥ 3 and lowest using the alternative categorical method (PPV = 58% vs 17%, respectively). Similar results were seen at 26-weeks (PPV = 50% vs 17%) ( Table 4).
Antepartum probable minor depression: PHQ-2 screening accuracy Table 5 presents the screening accuracy of the PHQ-2 at 4 time-points for at least probable minor depression. During pregnancy (booking, 36-weeks) a PHQ-2 cut-point of ≥ 3, correctly classified 50% of women with probable minor depression at both time-points but 50% of at-risk women were missed. Some 2.7 and 5.3% of low-risk women would have been asked to complete the EPDS unnecessarily. A lowered PHQ-2 cut-point correctly classified 100% of women with probable minor depression at both booking and 36-weeks. However, 12.7 and 17.5% low-risk women would have been asked to complete the EPDS. Using the alternative categorical approach (positive response to either Q1 or Q2), though 100% of women with probable minor depression were correctly identified, the EPDS would have been administered to 31.2 and 40.3% of women below the EPDS threshold for probable minor depression. Among those who screened positive on the PHQ-2, the probability of also screening positive on the EPDS for at least probable minor antepartum depression (EPDS ≥ 13) at booking was greatest using a cut-point ≥ 3 and lowest using the alternative method (PPV = 43% vs 12%, respectively). Similar results were seen at 36-weeks (PPV = 30% vs 10%) ( Table 5).

Postpartum probable minor depression: PHQ-2 screening accuracy
Following birth (6-weeks, 26-weeks), the PHQ-2 cutpoint of ≥ 3 correctly classified 39 and 19% of women with at least probable minor depression but missed 61 and 81% of at-risk women. Only 2.1 and 1.5% of women under the threshold for minor depression would have been asked to complete the EPDS. At a PHQ-2 lowered cut-point of ≥ 2, 73 and 59% of women with probable minor depression were correctly classified at baseline and 26-weeks respectively. Consequently, some 27 and 41% of at-risk women were missed. At the lowered cutpoint, 8.9 and 6.1% of low-risk women would have been asked to complete the EPDS unnecessarily. While the alternative method achieved the highest overall sensitivity following birth, correctly classifying 82% of women at 6weeks and 81% of women at 26-weeks, 19% of low-risk women would have been asked to complete the EPDS at both postpartum time-points. Among those who screened positive on the PHQ-2, the probability of also screening positive for at least probable minor postpartum depression (EPDS ≥ 10) at 6-weeks was greatest using a cut-point ≥ 3 and lowest using the alternative categorical method (PPV = 68% vs 33%, respectively). Similar results were seen at 26-weeks (PPV = 60% vs 34%) ( Table 5).

Discussion
This study evaluated the screening accuracy of the PHQ-2 to inform the future use of the ICHOM core outcome set for pregnancy and childbirth in clinical practice and research. Two methods of caseidentification were used: a scoring method with two cutpoints (≥ 2 / ≥ 3) and an alternative dichotomous method. The EPDS was used as the reference standard at recommended cut-points [16,34,35] and demonstrated acceptable internal consistency reliability across all four time-points (a = .85-89). In contrast the mean inter-item correlations of the PHQ-2 (MIC = .51-.60) were high [42], suggesting some possible overlap between anhedonia and depressive symptoms during the perinatal period. At the ICHOM recommended PHQ-2 cut-point of ≥ 3, screening accuracy to detect probable major depression was fair during pregnancy (AUC = .77), and poor to good postpartum period (AUC = .67-.88). While the recommended cutpoint demonstrated low sensitivity during pregnancy (57-60%) and postpartum (36-79%) and missed an unacceptably high number of women with probable major depression, specificity remained high (94-98%) thus minimizing response burden, an identified priority to the ICHOM team [14]. Lowering the cut-point to ≥ 2 achieved the highest screening accuracy for probable major depression of the three methods, demonstrated by good to excellent AUC values at all timepoints (AUC = .88-.93) and high sensitivity (86-100%). Further, specificity was moderate throughout (80-89%). In contrast, while an alternative dichotomous method achieved only fair to good diagnostic accuracy (AUC .79-.86), it did demonstrate the highest overall sensitivity (100% during pregnancy and early postpartum and 93% late postpartum). Specificity was however lower than the other two methods (58-71%). The alternative method thus identified almost all at risk women but at the cost of increased response-burden. The EPDS was used as the reference standard For at least probable minor depression, screening accuracy was highest using the PHQ-2 cut-point of ≥ 2 (AUC = .76-.94) and lowest using the PHQ-2 cut-point of ≥ 3 (AUC = .58-.74). While sensitivity was lowest at a cut-point of ≥ 3 (19 to 50%) it did demonstrate the highest specificity (95-98%). In contrast, while the alternative dichotomous method had the highest sensitivity (81-100%), specificity was low (60-74%).
The ability to compare our findings was hindered by a lack of comparable perinatal studies. A systematic review which evaluated the accuracy of screening instruments for perinatal women [18], identified sparse evidence pertaining to the PHQ-2. Further, the reporting of two similar tools within the literature compounds the challenge of comparing findings as the Whooley Questions are often confused with, and referred to, as the PHQ-2 and vice versa [27,28]. The two tools are Table 4 Screening accuracy of the PHQ-2* during pregnancy and postpartum using the EPDS as reference standard for probable major depression also referred to, or combined, as case-finding or caseidentification methods [29,45]. As such we compared our findings to the literature pertaining to both the PHQ-2 and the Whooley Questions, where appropriate to do so.

Screening accuracy of case-identifying methods
The literature identifies two almost identical caseidentifying methodsthe PHQ-2 and the Whooley Questions. In terms of the PHQ-2, authors of a systematic review revealed a pooled sensitivity of 0.76 and specificity of 0.87 at a cut-point of ≥ 3 in the general population [27]. Consistent with the current study, sensitivity improved at a cut-point of ≥ 2 (sensitivity = 0.91) and specificity decreased (70%). Findings of that review are however limited by the largely mixed-gender, and recruitment from primary/secondary settings and substantial heterogeneity between studies (I 2 = 81.8%). Only one included study by Smith and colleagues [46] reported  findings for perinatal women. Similarly, a meta-analysis of the diagnostic accuracy of the Whooley Questions to identify major depression included ten studies with community samples [28] including two in peripartum women [45,47]. Consistent with the current study, Bosanquet [28] reported a pooled sensitivity of 95% (95% CI: 0.88-0.97), and pooled specificity of 65% (95% CI: 0.55-0.74) using the dichotomous response format. Maternity-specific evidence regarding diagnostic accuracy is sparse. With a focus on postpartum depression, though Mann and Gilbody [29] identified seven studies reporting either case-identification method (PHQ-2 and Whooley Questions), only one study met inclusion criteria using clinical diagnostic criteria. The included study by Gjerdingen et al., [47] reported diagnostic accuracy of both the PHQ-2 and Whooley Questions. As a screening tool for major postpartum depression, Gjerdingen reported the Whooley method to have 100% sensitivity, 62% specificity, 11% PPV and 100% NPV, comparing favorably to the findings of the current study for at least probable major depression at six weeks postpartum (sensitivity: 100%, specificity: 71%, PPV: 17%, NPV: 100%). Though Gjerdingen evaluated the PHQ-2, cut-points were not evaluated preventing further comparison. In terms of the Whooley Questions, recent work by Littlewood and colleagues [48] revealed the alternative method (positive response to either or both questions) to demonstrate acceptable sensitivity and specificity but low predictive value during pregnancy (20 weeks: sensitivity: 85%, specificity: 83.7%, PPV: 37.4) and postpartum (3-4 months: sensitivity: 85.7%, specificity: 80.6%, PPV: 31.4) periods which is consistent with current findings. Current study findings for at least probable minor depression at similar timepoints reveal similar results following birth, but greater sensitivity during pregnancy (baseline: sensitivity: 100%, specificity: 69%, PPV: 12) and postpartum (26 weeks: sensitivity: 81%, specificity: 74%, PPV: 34). The disparity in findings likely reflect the difference in reference standards used; the current study used the EPDS, while Littlewood used the Client Interview Schedule-Revised [49]. Further, the impact of question wording differences in terms of time format (last-2 weeks versus past month) is not yet known.
An important observation seen in the current study but not previously evaluated by others is the significant increase in anhedonia seen during pregnancy and significant decrease following birth which was not seen with the PHQ-2 depression question. It is possible that women may experience low mood as pregnancy progresses without a significant change in levels of depressive symptoms. The impact of anhedonia may then artificially inflate the number of women meeting the screening threshold. Further research to identify the impact of each item during the perinatal period is warranted.

EPDS as a reference standard
Consistent with the current study, the PHQ-2 has been evaluated as a modified dichotomous yes/no screening tool using the EPDS as the reference standard (37,38). Using an EPDS cut-off score of ≥ 13 to denote probable major depression during pregnancy (15-weeks, 30-weeks) and postpartum (6-16 weeks), Bennett and colleagues [37] reported the 'modified version of the PHQ-2' to have high sensitivity (80-93%) and specificity (75-86%), with sensitivity being highest during pregnancy and lower postpartum. At the same EPDS cut-point (≥ 13) the current study showed a slightly higher sensitivity (93-100%) and lower specificity (60-71%). These differences may be attributed to sample characteristics and research methodology. Bennett's cross-sectional study recruited young, low income, less educated women with higher rates of depressive symptoms compared to the current study. Of significance, Bennett reported evaluating a modified version of the PHQ-2 which were the Whooley Questions (During the past month …), further demonstrating the inconsistency and potential confusion in instrument reporting.
Consistent with the current study, Chae et al., [38] evaluated the PHQ-2 using the dichotomous method against the EPDS as the reference standard using a cutpoint ≥ 13 in a sample of women (n = 200) attending well-child clinics in the USA at 6 weeks and 6 months postpartum. Chae reported 100% sensitivity and 79% specificity, which compares well to the 93-100% sensitivity and 70-71% specificity in the current study. More recently, Howard et al., [50] compared the Whooley Questions and the EPDS (cut-point ≥ 13) to clinical diagnostic criteria in a cross-sectional study with 545 women attending their first booking appointment. These authors reported low sensitivity for the Whooley Questions (41%) and EPDS (59%), with a concomitant high specificity (94-95%). While few comparable studies exist for the Whooley Questions, sensitivity of the EPDS is lower than usually reported and might be explained by the large sample size, diverse sample of women and delays in administering the diagnostic interview.

Clinical implications
The ICHOM recommended cut-point of ≥ 3 on the PHQ-2 to screen for probable major depression is consistent with recommendations by the scale developer [15] despite being based primarily on a validation study with patients in primary and secondary care. Our findings demonstrate that at this cut-point an unacceptably high number of 'atrisk' women were missed and would not have received the follow-up EPDS. While a cut-point of ≥ 2 provided highest accuracy in terms of area under the curve, and optimal sensitivity and specificity during pregnancy, screening accuracy was lower following birth. Further, in clinical practice where the identification of high-risk women is crucial, we would argue that an alternative dichotomous method missed the least number of at-risk women and is the most appropriate method of case-identification during the perinatal period. Further, this method maintained the highest sensitivity even with a lowered EPDS cut-point to denote at least probable minor depression. This finding indicates that the alternative dichotomous method would better identify women with lower levels of depressive symptoms compared to the scoring method to better inform clinical-decision making. While a higher false positive rate is noted with this method, we argue that where universal screening using the EPDS is already currently practiced, this method could positively impact response burden for many women.
In Australia government-funded projects are working towards a standardized approach to outcome reporting in maternity-related practice and research to improve data-synthesis and outcomes for women and their babies. Our findings offer important evidence and recommendations to support standardization.

Strengths and limitations
Despite our best efforts, our findings do have some limitations. Our study included a cohort of women from one Australian birthing facility and identified a low prevalence of women with probable major depression during pregnancy (around 2%) and postpartum (around 6%) which contributes to a less precise measure. While our prevalence rates are slightly less than those reported by Gaynes et al., [51] during pregnancy (3.1-4.9%), our findings are comparable to postpartum rates (1.0-5.9%). Recruiting larger, more diverse samples would improve precision of the prevalence estimate and generalizability of findings. We used the EPDS as the reference standard rather than a clinical diagnosis of depression which would generally be considered gold standard. However, the aim here was to evaluate the PHQ-2 against the EPDS as would occur in real-life clinical practice using the ICHOM core outcome set and not to diagnose depression. Our findings were however strengthened by the comprehensive nature of our analysis which included using two case-identification methods over four timepoints. Further, we applied widely-accepted EPDS cutpoints to denote both probable major depression and at least probable minor depression.

Conclusions
ICHOM recommend the PHQ-2 to screen women using a scoring method at a defined cut-point. The recommended method has been shown to be inadequate as a probable perinatal depression case-identification method using the EPDS as the reference standard. To optimise the number of women identified as 'at-risk' in clinical practice, we recommend a dichotomous two-item caseidentification method which is consistent with recommendations of international guidelines [7] and has been shown to be acceptable to women [48]. Further, to address the current confusion surrounding the use and reporting of the case-identification method [27,28], we recommend the use of the Whooley Questions rather than the PHQ-2. The use and reporting of consistent question wording and response format will improve future outcome reporting and synthesis.
Additional file 2. Edinburgh Postnatal Depression Scale: Questions and screening accuracy.