Measuring physical and mental health during pregnancy and postpartum in an Australian childbearing population - validation of the PROMIS Global Short Form

Background Health related quality of life is a critical concept during the perinatal period but remains under-researched. The International Consortium for Health Outcomes Measurement have included the Patient Reported Outcomes Measurement Information System (PROMIS®) Global Short Form (GSF) in their core outcome set for pregnancy and childbirth to measure health related quality of life. The PROMIS GSF has not been fully evaluated as a valid and reliable instrument in this population. This study assessed the psychometric properties of the PROMIS GSF during pregnancy and postpartum period. Methods PROMIS GSF was administered to a sample of 309 pregnant women at four time-points during pregnancy (≤ 27 and 36-weeks) and postpartum (6- and 26-weeks). The structural validity, internal consistency reliability, construct validity, and responsiveness of the PROMIS GSF were evaluated. The internal structure of the PROMIS GSF was explored using Rasch Measurement Theory. Response format, item fit, differential item functioning (item bias), dimensionality of the scale and its targeting were assessed. Results Two revised subscales (Mental Health: four items; and Physical Health: five items) showed good fit to the Rasch model. The revised mental health subscale demonstrated good internal consistency reliability during pregnancy and postpartum period (α = .88 and .87, respectively). The internal consistency reliability of the physical health subscale was adequate (α = .76 and .75, respectively). The revised mental health subscale was sensitive to group differences according to a history of mental health disorder, income, smoking status, drug use, stress levels and planned versus unplanned pregnancy. Differences in scores on the revised physical subscale were detected for groups based on obesity, income, drug use, smoking status, stress, and history of mental health disorders. Scores on both subscales recorded significant changes across the four time-points, spanning pregnancy and postpartum period. Conclusions The revised version of the PROMIS GSF was better able to measure mental and physical health during pregnancy and postpartum period compared to the original version. Findings support the clinical and research application of the PROMIS GSF within the International Consortium for Health Outcomes Measurement Standard Set of Outcome Measures for Pregnancy and Childbirth. Ongoing psychometric analysis of the PROMIS GSF is recommended in other maternity populations.


Background
Health related quality of life (HRQoL) is well researched in areas of disease and chronic health conditions but remains under-researched in women's perinatal health. While HRQoL is not generally well defined within the literature, it does refer to a multi-dimensional concept to examine the impact of health status on quality of life [1]. Adapting a definition by O'Connor [2], HRQoL for maternity populations can be defined as 'a multidimensional concept referring to a woman's perception of the influence of her pregnancy, birth and postpartum condition, her care provision and any intervention and treatment on her physical, mental, emotional, and social functioning. HRQoL is acknowledged as a critical concept in the childbearing period [3] that goes beyond the traditional broad metrics of morbidity, mortality and life expectancy. While several socio-demographic, physical and psychological factors are known to influence the quality of life of pregnant women [4], the maternity model of care women receive during this time may also have a significant impact on their future HRQoL.
Around the world women are experiencing fragmented maternity care [5], and unnecessary medical procedures and birth interventions [6]. Cesarean birth rates are at an all-time high with a continued upward trend for cesarean birth in all but a few OECD (Organization for Economic Cooperation and Development) countries. In countries with the highest rates of cesarean birth, (for example Turkey, Mexico and Chile), around half of all babies are born this way [7]. In Australia, where the current study was conducted, one in three women experienced surgical birth in 2016 [8]. A growing number of women report obstetric violence; bullying, abuse, disrespect and coercion at the hands of their care givers [9][10][11]. Birth fear and birth related trauma are increasingly reported in the literature [12,13].
While such events during pregnancy, birth and the postpartum period may adversely affect HRQoL, little is known. This may reflect the challenges of measuring such a construct during a dynamic period in women's lives. Furthermore, a lack of validated tools for use during this unique time, and a wide variation in outcomes measured, compounds the issue. A systematic review of quality of life measures utilized in pregnant and postpartum mothers revealed not only a significant variation in both outcomes and outcome measures used, but a limited number of woman-reported outcome measures that had been formally developed and validated with maternity populations [14]. The most widely-used scales were the SF-36 [15] and SF-12 [16] from the Medical Outcomes Study (MOS), generic instruments that measures HRQoL in diverse patient populations. Most included studies focused on the psychometric properties of the measurement tool alone, and all failed to explore whether the measures were missing relevant items and constructs. The extent to which quality of life items had similar meaning to pregnant and postpartum women compared to non-pregnant women has not been investigated previously. Such limitations can contribute to poor clinical practice or policy change decisions based on inaccurate findings.
To address the heterogeneity of measures and outcomes often used in research, there is a growing consensus advocating the standardization of outcomes and outcome measurement using core outcome sets to improve the synthesis of research findings and minimize research wastage [14,17]. In 2016, the International Consortium for Health Outcomes Measurement (ICHOM) published a core outcome set for use during pregnancy and childbirth [18]. This minimum standard set of internationally appropriate outcome measures is recommended to be collected during pregnancy and postpartum period to assist health providers evaluate and improve the value of care provision [19]. Acknowledging the value of measuring HRQoL, ICHOM included the Patient Reported Outcomes Measurement Information System (PROMIS®) Global Short Form (PROMIS GSF).
Using global health items from an item bank developed by the National Institute of Health, Hays and colleagues [20] validated the 10-item PROMIS GSF using classic test theory and a sample drawn from the 2000 United States (US) census data. Classic test theory has some weaknesses compared to contemporary techniques, particularly in the approach to data management and analysis [21]. In classic test theory the approach is to describe a full data set, whereas, techniques such as Rasch Measurement Theory obtain data that fits the model, thereby overcoming the limitations of classic test theory and enhancing the precision of outcome measures [22]. Modern test theory, including Rasch models can improve on the classical approach when validating woman-reported outcome measures [23]. The sample used by Hays et al. [20] included both clinical and community samples of women (52%) and men (48%) with a mean age of 53 years (range 18-100). ICHOM working party members recognized the limited use of some of the included measures in maternity populations as a limitation [19]. A HRQoL instrument must be valid and have high reliability and responsiveness in the population being studied [24]. To facilitate the universal adoption of the ICHOM Standard Set into all trials evaluating maternity care, there is an urgent need to evaluate the tool with maternity populations. To date, only one paper has reported using the PROMIS GSF with women during pregnancy [25]. The findings of this validation study are limited however, by the culturally diverse background of the 161 pregnant women in the United States: Hispanic (42%), non-Hispanic woman of colour (37%), non-Hispanic white (14%), and multiracial/other (7%), limited evaluation of structural validity, and no evaluation during the postpartum period [25]. To address these limitations and facilitate ongoing and global comparison of outcomes, the current study sought to conduct a Rasch-based psychometric evaluation of the PROMIS GSF to measure HRQoL in one Australian childbearing sample using a standardized approach.

Aims
The MoMeNT Study (Models Meeting Needs Over Time) is a longitudinal cohort study with two primary aims: (1) to assess the feasibility of the ICHOM Standard Set of Outcome Measures for Pregnancy and Childbirth in one Australian population, and (2) report the perinatal outcomes for women accessing all models of maternity care in one facility to facilitate benchmarking and comparison of maternity models of care.

Methods
A prospective, observational cohort study was conducted.

Setting
The study was conducted in one large tertiary referral hospital in a metropolitan area of Queensland, Australia. The study site provides two broad models of maternity care: caseload midwifery care and non-caseload care. Caseload midwifery care refers to one primary midwife who is responsible for the care of a caseload of approximately 40 women per year, and provides holistic, relational continuity of care for each woman during pregnancy, labour, birth and for up to six-weeks postpartum with back-up from a midwife partner. Caseload midwives can provide pregnancy care in the home, satellite clinics and the hospital.
Labour and birth care is provided in the hospital setting. Non-caseload care is an umbrella term encompassing midwifery, general practitioner, obstetric care or a combination. Women receiving non-caseload care receive no continuity of care. Labour and birth care is generally provided by carers not usually known by the woman. Postpartum care by a midwife is limited (usually two visits).
Participants, sample size, recruitment and attrition Pregnant women were eligible to participate if they were 27-weeks gestation or less (at recruitment), aged 18 years or more, and able to complete an online survey in English. Women with an existing serious mental illness under the care of a psychiatrist were excluded. Recruitment occurred between August 2017 and March 2018.
Sample size was based on evaluating the broad effect of model of maternity care on maternal health and wellbeing. G*Power (3.0.10) sample size calculator was used. To identify a mean difference (two-tail) with a 50% effect size, 5% estimated error and 95% power (1 -β err prob), 210 participants were required. To allow for 20% attrition 252 participants were needed (126 in each group). Of the 1275 women screened in the public caseload midwifery and non-caseload models, 528 (41.4%) were eligible to participate. Reasons for ineligibility were due to: greater gestation date (n = 723), unable to communicate in English (n = 20), age less than 18 years (n = 2), and mental health disorder under the care of psychiatrist (n = 2). Eligibility was care model dependent: 25.8% of women in non-caseload and 85.7% of women in caseload were eligible to participate. Of those eligible, 354 women (67%) were invited to participate. In non-caseload care 20 women were missed. In the caseload midwifery model 108 women (40.1%) were not initially invited by their primary midwife. Feedback from midwives suggested barriers to recruitment included having no tablet device or consent forms available in the home setting. Of those invited, 336 (94.9%) completed consent forms. Reasons for declining were: too busy (n = 8), intended to move during pregnancy (n = 2), or no reason given (n = 8). The baseline survey was completed by 309 (92.0%) women.
Participants birthed between November 2017 and September 2018. Of the 309 women who completed the baseline survey (Time 1), nine did not continue (as outlined in the participant flow diagram - Fig. 1). Three women withdrew consent, two experienced late miscarriage, and four moved away. Eight women were ineligible to complete the 36-week survey due to birthing prematurely/prior to 36-weeks, six birthed in another hospital, and two were missed. These women remained in the study and received subsequent surveys.

Measures
The ICHOM core outcome set was administered in full according to the ICHOM Pregnancy and Childbirth Data Collection Reference Guide [18]. Each online survey included the PROMIS GSF. The baseline survey obtained socio-demographic, medical, obstetric and psychosocial details including participant age, gestation, relationship status, education, income and country of birth (See Additional file 1 for details of all variables used in analysis).

Procedure
Eligible women attending their first antepartum booking or follow-up visit were approached by either the first author or by their primary midwife and informed of the study aims and requirements. Consenting women provided written consent to participate. Women accessing care by a midwife were provided with a tablet device and completed the baseline survey during the first midwife visit. In a pilot of ten women, time and resource barriers in the antenatal clinic prevented tablet completion for women accessing non-caseload care. These women were sent a survey link by email and text message. Women who failed to respond to the survey within 2 to 3 days were sent a friendly reminder by text message and email. After a second reminder, telephone follow-up was made offering assistance to complete the survey by telephone. All surveys administered at all subsequent time-points were sent to all women using the same text and email protocol. All women were contacted by phone prior to survey administration to gain ongoing consent and discuss survey requirements. Non-responding women were deemed lost to follow up if they failed to complete two consecutive surveys. Surveys were completed during pregnancy: 27-weeks or less (Time 1), 36-weeks (Time 2), birth week (Time 3), 6-weeks postpartum (Time 4) and 26-weeks postpartum (Time 5). This paper reports on HRQoL data obtained at T1, T2, T4 and T5. All participants who commenced the PROMIS GSF at each time point, completed all questions.

Conceptual framework for instrument evaluation
Standards developed by COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) were used to guide the psychometric evaluation of the PROMIS GSF. Firstly, standardized terminology and definitions of measurement properties are reported [27]. Next, standards aimed at evaluating the methodological quality of studies on measurement properties [28] are used to guide the psychometric evaluation. Lastly, standards for criteria for good measurement properties are applied to the results to assist interpretation [29].

Data analysis
Participant sociodemographic characteristics were summarized using descriptive statistics. Frequency and percentages are reported for categorical variables and means and standard deviations or non-parametric alternatives (median and interquartile range) are reported for continuous variables. Normality assumptions were assessed visually, and the Kolmogorov-Smirnov with Lilliefors Significant Correction test value considered. To establish representativeness of the sample National and State perinatal data statistics, compiled by the Australian Institute of Health and Welfare [30][31][32][33] and Queensland Health [34] are presented.

Structural validity and internal consistency reliability
Rasch analysis, using the partial credit model of RUMM2030 [35], evaluated the structural validity of the PROMIS GSF in the sample of women during both pregnancy (baseline) and postpartum period (26-weeks). The total 10-item version of the PROMIS GSF, and the alternate 4-item subscales (global mental health and global physical health) proposed by Hays et al. [20] were evaluated. The procedures adopted were consistent with those recommended by Pallant and Tennant [36] and involve a detailed assessment of the response format, item fit, differential item functioning (item bias), dimensionality of the scale and its targeting. Overall fit of the scale is indicated by a fit residual standard deviation value of 1.4 or less, and a non-significant chi-square statistic. The fit of individual items is assessed, with an individual fit residual value above + 2.5 indicating misfit to the model. The response format is evaluated by inspecting the thresholds and ensuring that none are disordered. The internal consistency reliability of the scale is assessed using a person separation index (PSI) and Cronbach alpha coefficient (α), with values exceeding 0.70 considered acceptable [37], however higher values (e.g. above .85) are desired for tools used for clinical diagnosis. Differential item functioning (DIF), which occurs when groups within the sample respond in a different manner to an individual item, despite equal levels of the underlying characteristic [36] is assessed in RUMM2030 both statistically (using Analysis of Variance) and graphically. DIF was evaluated for age group (18-27 yrs., 28-32 yrs., 33 + years), education level (primary or secondary school versus post-secondary training) and parity (primiparous versus multiparous).
It is important that scales providing a total score are unidimensional, and this is assessed in RUMM2030 using a two-step process. Firstly, principal components analysis is conducted on the residual correlation matrix to identify subsets of items with high positive or negative loadings on the first unrotated factor. Rasch derived scores are calculated for each of these subsets separately and are compared for each respondent using a series of t-tests. A scale is considered unidimensional if fewer than 5% of tests are significant, or if the lower bound of the confidence interval is lower than 5%. The residual correlation matrix is also used to assess local dependency among items, indicated by items showing intercorrelations exceeding .2 above the average interitem correlation value.

Construct validity
Correlations between the two revised GSF subscales derived from Rasch analysis (Mental Health-Pregnancy Postpartum: MH-PP, and Physical Health-Pregnancy Postpartum: PH-PP) were calculated using Spearman correlation coefficients. The strength of the relationship was interpreted using Cohen's guidelines [38]; (r = .10 to .29 for small; r = .30 to .49 for medium and r = .50 to 1.0 for a large effect).
Independent samples t-tests were conducted to compare mean scores for mental health and physical health on several grouping variables known to influence health [39][40][41]. These were: age, parity, education, income, country of birth, obesity, history of mental health disorder, gestation, planned pregnancy, smoker in the past 12 months, prepregnancy illicit drug use, stress and previous cesarean birth. Regarding mental health, it was hypothesized that there would be no group differences in terms of age, parity, education, country of birth, obesity, gestation, or previous cesarean birth. It was further hypothesized that women who had low income, a history of mental health disorder, were pre-pregnancy drug users, smoked cigarettes in the past 12 months, had experienced stress in the past 12 months, or whose pregnancy was unplanned, would experience significantly worse mental health compared to their group counterparts.
For physical health, it was hypothesized that there would be no group differences in terms of age, parity, country of birth, history of mental health disorder, gestation, planned pregnancy, stress or previous cesarean birth. Based on the key determinants of health [40] it was further hypothesized that women who were obese, smoked cigarettes, reported pre-pregnancy illicit drug use, low income, or had low education attainment, would experience significantly worse physical health compared to their group counterparts. Homogeneity was assessed using Levene's test for equality of variance. Effect size is presented as Cohen's d and interpreted as: .2 = small effect, .5 medium effect, and .8 = large effect [38].

Responsiveness
A one-way repeated-measures analysis of variance (ANOVA) assessed change in mental and physical health scores (responsiveness) over four-time periods. Post hoc tests, with pairwise comparisons between pairs of timepoints, were performed for statistically significant results. Significance was set at 5% (p ≤ .05). Effect size is presented as partial eta squared and interpreted as: .01 = small, .06 = moderate, and .14 = large effect [38]. Physical and mental health trajectories may differ based on baseline symptom presence and severity [42]. While limited evidence exists on the trajectory of physical health during the peripartum period, poorer physical health is associated with poorer mental health [43]. Pregnancy is associated with poorer physical health in terms of back and pelvic pain, fatigue [42] and incontinence [44] we therefore expected to see a corresponding deterioration in mental health during pregnancy with subsequent improvement by 26-weeks postpartum.

Sample characteristics
The socio-demographic characteristics of the sample (N = 309) are presented in Table 1. At recruitment most participants were in their second trimester of pregnancy (mean = 19.7 weeks, SD = 3.7, range = 10-27 weeks). Women's ages ranged from 19 to 43 years with a mean age of 29.8 years (SD = 5.0). Most women were either married or in a de facto relationship (94%). The sample were well educated, with almost 90% of women (n = 275) having completed high school (Year 12), a Diploma or Degree. Almost three quarters of women were in paid work (n = 227) and one in ten were studying (n = 30). Seventy three percent of women were born in Australia. Of those born outside of Australia, the most common places of birth were New Zealand (n = 30), United Kingdom (n = 16) and Asia (n = 14). Less than 10% of women came from the Americas (n = 8), Brazil (n = 6), South Africa (n = 5), Europe (n = 4) or Tahiti (n = 1). Similarly, one in ten women spoke a language other than English at home (n = 31). Of the 90% of women who responded, over half were in the medium highest and highest income groups. While the cohort was similar to National and State populations in terms of age, parity, body mass index (BMI) and country of birth, participants were more likely to be in a relationship.

Rasch analysis of the PROMIS mental and physical health Mental health subscale
Rasch analysis of the original four items of the Mental Health subscale: items 2 (quality of life), item 4 (mental health), item 5 (social discretionary), and item 10 (emotional problems) administered in pregnancy at baseline indicated some degree of misfit among items (see Table 2: Analysis 1) with a fit residual standard deviation value of 2.35, exceeding the recommended value of 1.4. Item 10 (emotional problems) recorded an individual Fit Residual value of 2.75, suggesting misfit to the Rasch model. Removal of this item improved overall fit and resulted in a 3-item scale (see Table 2: Analysis 2) with no evidence of misfit items, no differential item functioning by age, education or parity, and no evidence of local dependency. The scale met the requirements for unidimensionality, with the lower bound of the confidence interval around the percentage of cases with significantly different subtest scores (2.8%) falling below the 5% criteria (see Table 2: Analysis 2). The internal consistency reliability was adequate, with a PSI value of .77 and Cronbach's alpha value of .84.
An additional Rasch analysis assessed the suitability of including item 9 (social roles) in the revised 3-item Mental Health subscale. This alternative subscale showed good fit to the Rasch model (see Table 2: Analysis 3). This 4-item solution showed a substantial improvement in the internal consistency reliability, with an increase in the PSI value to .84, and Cronbach's α value to .88.
To assess suitability of the revised four-item version of the Mental Health subscale (items 2, 4, 5, 9) for use with postpartum women, Rasch analysis was conducted on the responses obtained from respondents at 26-weeks post birth. The subscale showed good fit to the Rasch model, with no misfitting items, no differential item functioning for age, education or parity, and no evidence of local dependency or multidimensionality (see Table 2: Analysis 6). The internal consistency reliability of the subscale was good (PSI = .84). The Person-Item Threshold Distribution map is shown in Fig. 2, suggesting appropriate targeting of the items to this sample of women, with no evidence of floor or ceiling effects.

Physical health subscale
Rasch analysis of the original four items of the Physical Health subscale: item 3 (physical health), item 6 (physical function), item 7 (pain) and item 8 (fatigue) administered to pregnant women at baseline showed adequate fit to the Rasch model (see Table 2: Analysis 4), however the internal consistency was quite low (PSI = .64, Cronbach α = .69). Assessment of an alternative version of the Physical Health subscale which included item 1 (general health), showed good fit to the model, with a substantial improvement in the PSI value (from .64 to .74) and Cronbach's α value (from .69 to .76). The final 5-item version of the Physical Health subscale showed good fit to the Rasch model, no evidence of misfitting items, no differential item functioning for age, education or parity and no local dependency ( Table 2: Analysis 5). The subscale showed no evidence of multidimensionality, with the percentage of persons with significantly different subtest scores not exceeding the criteria of 5% (see Table 2, Analysis 5). The Person-Item Threshold Distribution map, presented in Fig. 3, also supports the appropriate targeting of the level of physical health in this cohort of pregnant women. No floor or ceiling effects were detected. Rasch analysis of the revised 5-item version of the Physical Health subscale was also conducted on data obtained at 26-weeks post birth. There were no misfitting items, no differential functioning for age, education or parity, and no local dependency. Analyses supported its unidimensionality (see Table 2, Analysis 7).  Based on these results, two revised subscales are proposed for use during pregnancy and postpartum period (see Table 3). The proposed physical health subscale (PH-PP) includes five items: item 1 (general health), item 3 (physical health), item 6 (physical function), item 7 (pain), and item 8 (fatigue). The proposed mental health subscale (MH-PP) includes four items: item 2 (quality of life), item 4 (mental health), item 5 (social discretionary), and item 9 (social roles). Total scores for these scales were calculated by summing item scores (on a 1-5 range) and dividing by the number of items in the scale. Higher scores indicated better physical and mental health.
Spearman correlation coefficients showed a strong correlation between the MH-PP and PH-PP subscales during pregnancy at baseline (rho = .64) and at 26-weeks postpartum (rho = .67), indicating that they shared 41.0 and 44.9% of variance respectively.

Trajectory of mental and physical health during pregnancy and postpartum
Descriptive statistics for the revised mental health (MH-PP) and physical health (PH-PP) subscales obtained over four time-points during pregnancy and postpartum are presented in Table 4. A series of one-way repeated   Figure 4 shows mental health scores on the MH-PP were highest at baseline, reduced during pregnancy (36weeks) and early postpartum (6-weeks) before returning to values seen in late pregnancy at 26-weeks postpartum, suggesting a deterioration of mental health from that seen at baseline at all time points. The effect was significant for time (Wilks' Lambda = .94, F (3, 218) = 5.03, p = .002). The partial eta squared value .07 indicates a moderate effect size. Four paired samples t-tests were used to make post hoc comparisons between time points. There was a significant difference between MH-PP scores from baseline to 36-weeks (p = .03), baseline to early postpartum (p = .003) and baseline to 26-weeks postpartum (p = .03). There were no statistically significant differences between any other time points.

Construct validity of the revised mental health subscale
A series of independent sample t-tests compared MH-PP scores obtained during pregnancy (baseline) and postpartum (26-weeks) for several groups (see Table 5). Women who reported low income, unplanned pregnancy, pre-pregnancy drug use or stress in the previous 12 months had significantly lower mental health scores during pregnancy. While the effect was large or very Pain scored from 0 to 10: 0 = No pain, 10 = Worst pain imaginable and subsequently reverse scored and recoded to 5 responses large during pregnancy, no difference was seen in the postpartum period. For women who reported smoking in the previous 12 months or a history of mental health disorder, mental health scores were lower than their group counterparts, both during pregnancy and postpartum period, with very large effect sizes. The largest effect was for women with a history of mental health disorder during pregnancy. No significant group differences were seen in terms of age, parity, education, country of birth, obesity, gestation at baseline, or previous cesarean birth.

Construct validity of the revised physical health subscale
A series of independent sample t-tests compared PH-PP scores obtained during pregnancy (baseline) and postpartum period (26-weeks) for several groups (see Table 6). Women who smoked, were on low income, were obese or who had a history of mental health disorder reported poorer physical health scores during both pregnancy and postpartum period. Women with low education attainment reported poorer physical health in pregnancy, but not during the postpartum period. Women who reported pre-pregnancy drug use had poorer antepartum physical health, but this was not seen in the postpartum period. Those women who experienced unplanned pregnancy or stress in the last 12 months reported poorer physical health, but only during pregnancy. Effect sizes for these comparisons were largest for the impact of low income in the postpartum period (Cohen's d = 0.51) and history of mental health disorder during pregnancy (Cohen's d = 0.49). No significant group differences in physical health were seen in terms of age, country of birth, gestation at baseline, and previous cesarean birth.

Discussion
This study addressed calls from the International Consortium of Health Outcomes Measurement's (ICHOM) working party [19] to validate and refine the instruments included in their Standard Set of Outcome Measures for Pregnancy and Childbirth. Rasch analysis of the PRO-MIS Mental Health and Physical Health subscales assessed the suitability of subscale items and response format, and to detect any potential item bias, local dependency and multi-dimensionality and targeting. For the Mental Health subscale an optimal solution, showing good psychometric properties, was obtained by removing item 10 (emotional problems) and adding item 9 (social roles). This revised four-item subscale (MH-PP) recorded good internal consistency reliability, with no evidence of problems with the response format, item bias, local dependency, or multi-dimensionality. The original 4-item version of the Physical Health subscale was found  to have poor internal consistency reliability, which was improved by the addition of item 1 (general health) to form a revised 5-item version of the scale (PH-PP) with adequate psychometric properties. The revised MH-PP was found to be sensitive to differences in groups according to history of mental health, income, smoking and drug use, stress levels and planned versus unplanned pregnancy. Differences in scores on the revised PH-PP were detected for groups based on obesity, income, prepregnancy drug use, smoking status, stress, and history of mental health disorders. Scores on both subscales recorded significant changes across the four time-points, spanning pregnancy and postpartum.

Building on the contribution of others
This study extended research on the psychometric properties of the PROMIS by using modern test theory [22] such as Rasch analysis, to evaluate the properties of the scale in a sample of women, during both pregnancy and postpartum period. This differed from previous studies, such as Hays et al. [20], who used classical test theory approaches (exploratory and confirmatory factor analysis) and samples containing a diverse range of participants. Classic test theory analyses raw scores to test assumptions underlying a given measure. Items are thus summed without weighting or standardization to produce a score [45]. Rasch methodology on the other hand, is a sophisticated and robust method based on a mathematical item response model that affords several advantages over the classic test theory. The main difference being in the management and analysis of data [21]. In Rasch modelling, the probability of a specified response is modelled as a logistic function of the difference between the person and item parameter. Item locations are subsequently scaled (item calibration) and person locations are measured on the same scale. Each item and person estimate has an associated standard error of measurement. Rasch methodology thus enables the transformation of ordinal summed scores into linear measurements. Rasch methodology prioritizes the Rasch model and revisits hypotheses in the event of ill-fitting data [22]. Rasch methodology moves beyond traditional psychometric methods to rigorously evaluate the PRO-MIS GSF. Further, our study findings build on those of Lundsberg et al. [25] who investigated the use of PRO-MIS in a sample of women in their first trimester (M = 9 weeks, SD = 4.6), seeking pregnancy testing or services for termination of pregnancy in antepartum clinics in New Haven, USA. The sample was culturally diverse. Almost three-quarters of all pregnancies were unplanned (72.7%). Almost 40% of participants planned termination, adoption, or were unsure of plans and therefore likely to be experiencing considerable stress.

Scale modifications for maternity populations
In the current study, modifications to the original PROMIS subscales, and the alternative subscale structure recommended by Hays, et al. [20], were required to ensure they were appropriate for use with women during pregnancy and postpartum period. It was necessary to remove item 10 (emotional problems) from the Mental Health subscale and add item 9 (satisfaction with social roles).

Findings related to reliability and validity
Preliminary assessment of the construct validity of the MH-PP and PH-PP revealed significant differences in MH-PP scores for history of mental health disorder, stress, income, smoking, drug use and planned pregnancy. The PH-PP successfully distinguished groups based on healthrelated factors such as obesity, income, smoking and drug use, mental health disorder, stress levels and education attainment. Differences were also detected for parity, and planned pregnancy. These findings provide preliminary support for the construct validity of the revised PROMIS scales, for use in pregnancy and postpartum period. Further support for the revised subscales comes following the application of criteria for good measurement properties, as outlined by Prinsen and colleagues [29]. Structural validity using Rasch analysis demonstrated no violations of unidimensionality, local independence or monotonicity. Similarly, with regards to construct validity and responsiveness, at least 75% of results were in accordance with the hypotheses. All reported measurement properties of the two revised subscales, in terms of structural validity, internal consistency, hypothesis testing for construct validity and responsiveness, met the highest rating required of a good measurement property, indicating confidence in the psychometric ability of the revised scales.

Strengths and weaknesses
This study has two major strengths. Firstly, the comprehensive psychometric evaluation, including Rasch Analysis fully explored all aspects of the PROMIS Global Short Form, informing the development of two revised subscales to measure mental and physical health in a maternity population. Rasch analysis enhanced the findings of Hays et al. [20] and the item calibrations for measuring health related quality of life in terms of mental and physical health were not sample dependent. Further, to address the limitations of heterogeneity in terminology and definitions of measurement properties, consensus-based standards guided the study design and psychometric analysis of the PROMIS GSF. Standards developed by COSMIN were used to guide the psychometric evaluation of the PROMIS GSF and consensus-agreed taxonomy was used [27]. Employing these standards contributes to a transparent and standardized approach to the psychometric evaluation and will support the evidence-based inclusion of the PRO-MIS GSF within the ICHOM core outcome set.
This study was conducted with 309 women from one birthing facility in Australia. According to the broad aims, sample size was not calculated to measure a specified difference in health-related quality of life. While few studies using Rasch analysis conduct a priori sample size and power determination [46], inadequate sample size can lead to inaccurate results. While several researchers have studied the effect of sample size on power and effect size using simulation and Rasch techniques [47,48], little consensus exists. The sample size in this study (n = 309) exceeded the size recommended by Linacre 1994 [49] to ensure accurate estimation (99%) of person estimates within +/− logits.
Our comparisons with National and State maternity populations showed that the sample was similar to the general maternity population in several ways. However, some group differences were identified. Participants were more likely to be in a relationship and more likely to be in higher income groups compared to Australian National and State averages, outcomes known to positively impact a woman's health and wellbeing and subsequent health related quality of life [40]. It is possible that participants experienced better physical and mental health outcomes compared to other Australian samples. Replication of this study using larger numbers in diverse maternity populations is recommended.

Research implications
Our analysis revealed two revised scales to measure mental and physical health in women during pregnancy and postpartum period. Content validity was not evaluated. Findings using the original scales during pregnancy show lower reliability for the physical health subscale compared to that of the mental health subscale (α = .69 vs 0.83) These findings are mirrored by Lundsberg et al.    [25]. The content validity specific to a maternity population were not explored in the current study. Future researchers may wish to consider a thorough evaluation of content validity in any future development of a population-specific woman-reported outcome instrument, aimed at measuring HRQoL in maternity populations. The revised physical and mental health scores could be used to evaluate the effect of different models of maternity care provision or other lifestyle interventions on maternal physical and mental health. The evaluation of the impact of maternity care provision and outcomes related to care on women's HRQoL should be included in future evaluations.

Clinical implications
This study addresses the call from ICHOM for researchers to validate the instruments included in their Standard Set of Outcome Measures for Pregnancy and Childbirth [19]. Inclusion of this valid and reliable measure will make a positive contribution to the core outcome set and will facilitate comparison and benchmarking of one component of the ICHOM Standard Set. The current climate of contemporary maternity care demands that care meets the needs of women and their babies, yet little is known about the impact of maternity care provision, or pregnancy and birth experiences and outcomes on maternal HRQoL. Results of this study will inform future research into the impact of maternity care provision on HRQoL using valid and reliable tools.

Conclusion
This comprehensive psychometric analysis, which included Rasch analysis, demonstrated that a revised version of the PROMIS Global Short Form was better able to measure mental and physical health in a pregnant and postpartum population compared to the original generic instrument. While further evaluation of the revised scale is needed on other maternity populations, findings support the clinical and research application of the revised PROMIS GSF within the ICHOM Standard Set for Pregnancy and Childbirth.