The Kaiser Permanente Northern California research program on genes, environment, and health (RPGEH) pregnancy cohort: study design, methodology and baseline characteristics

Background Exposures during the prenatal period may have lasting effects on maternal and child health outcomes. To better understand the effects of the in utero environment on children’s short- and long-term health, large representative pregnancy cohorts with comprehensive information on a broad range of environmental influences (including biological and behavioral) and the ability to link to prenatal, child and maternal health outcomes are needed. The Research Program on Genes, Environment and Health (RPGEH) pregnancy cohort at Kaiser Permanente Northern California (KPNC) was established to create a resource for conducting research to better understand factors influencing women’s and children’s health. Recruitment is integrated into routine clinical prenatal care at KPNC, an integrated health care delivery system. We detail the study design, data collection, and methodologies for establishing this cohort. We also describe the baseline characteristics and the cohort’s representativeness of the underlying pregnant population in KPNC. Methods While recruitment is ongoing, as of October 2014, the RPGEH pregnancy cohort included 16,977 pregnancies (53 % from racial and ethnic minorities). RPGEH pregnancy cohort participants consented to have blood samples obtained in the first trimester (mean gestational age 9.1 weeks ± 4.2 SD) and second trimester (mean gestational age 18.1 weeks ± 5.5 SD) to be stored for future use. Women were invited to complete a questionnaire on health history and lifestyle. Information on women’s clinical and health assessments before, during and after pregnancy and women and children’s health outcomes are available in the health system’s electronic health records, which also allows long-term follow-up. Discussion This large, racially- and ethnically-diverse cohort of pregnancies with prenatal biospecimens and clinical data is a valuable resource for future studies on in utero environmental exposures and maternal and child perinatal and long term health outcomes. The baseline characteristics of RPGEH Pregnancy Cohort demonstrate that it is highly representative of the underlying population living in the broader community in Northern California. Electronic supplementary material The online version of this article (doi:10.1186/s12884-016-1150-2) contains supplementary material, which is available to authorized users.


Background
Exposures before and during pregnancy contribute to the immediate and future health outcomes of both women and their children. Emerging evidence supports the notion that the prenatal period is a critical developmental window during which in utero exposures may have lasting effects on a child's future health [1,2]. Biological programming [3] occurs during fetal life in response to in utero exposure to nutrient substrates, hormones, growth factors, cytokines, environmental conditions or toxins, and other exposures. Evidence also shows that women who develop pregnancy complications are at increased risk of developing chronic diseases later in life [4][5][6][7]. However, the mechanisms underlying many of these findings remain unclear, and further research is needed to advance our understanding of how the in utero environment impacts the short-and long-term health of both the woman and her child.
Large studies with multiple measurements of biomarkers during pregnancy are needed to better measure perinatal exposures and to understand the etiologically relevant period of the effects of exposures on perinatal outcomes. To fully understand how the in utero environment influences the short-and long-term health of women and their children, large representative study populations with comprehensive information on a broad range of factors, including biomarkers, medical conditions, medications, nutrition, physical activity and environmental exposures, are needed.
The Kaiser Permanente Northern California (KPNC) Research Program on Genes, Environment, and Health (RPGEH) has established a large pregnancy cohort that integrates biospecimens with rich and accurate clinical and health data available from the electronic health record (EHR), creating a unique resource available to advance research on women's and children's health. The establishment of this pregnancy cohort within an integrated health care delivery system with an EHR has the additional advantage of enabling accurate assessment of short-and long-term maternal and child health outcomes and the rapid translation of clinically meaningful findings into clinical practice. This report describes the design and methods used to establish this pregnancy cohort and its biorepository in KPNC. We present preliminary data on the baseline characteristics of the cohort to demonstrate its racial-ethnic diversity and the prevalence of several perinatal complications of interest, as well as its representativeness with regard to the underlying population of pregnancies at KPNC. We further discuss possible use of this large cohort including the ability to efficiently follow it prospectively through the EHR to answer pressing questions regarding women's and children's health.

Methods/Design
Aim The aim of this project is to establish a large pregnancy cohort that integrates biospecimens with rich and accurate clinical and health data to create a resource to advance scientific research on women's and children's health. The pregnancy cohort is able to be linked to short-and longterm maternal and child health outcomes to facilitate the rapid translation of clinically meaningful findings into clinical practice.

Design
The KPNC Division of Research started the Research Program on Genes, Environment and Health (RPGEH) in 2007 to develop a genetic epidemiology population resource which integrates data from multiple sources from consenting KPNC adult members, including biospecimens, clinical data from the EHR, lifestyle and risk factor data from surveys, and environmental exposure data from both laboratory and geographic information systems. One component of the RPGEH is the RPGEH Pregnancy Cohort.

Establishment of RPGEH pregnancy cohort
The Division of Research worked closely with KPNC clinical partners to develop facility-based recruitment procedures and laboratory blood processing workflows that could be easily integrated as part of routine prenatal medical care. The entire recruitment process was designed to become an integrated and routine part of the clinical prenatal intake process. To avoid disruption of clinical workflows, all RPGEH program-related processes (e.g., questions from patients, and follow-up) are handled by research staff. The recruitment and biospecimen collection protocol processes are described below.

Study setting
KPNC provides integrated health care to over 3.6 million members through 7,000 physicians, > 240 medical office buildings and 22 hospitals. The KPNC service area spans 14 counties of the greater Bay Area, as well as the California Central Valley from Sacramento to Fresno and includes urban and rural areas. The population is highly representative of the demographic characteristics of the entire population from this geographic area [8]. The membership is racially and socio-economically diverse. KPNC is vertically integrated such that all care is provided in a closed system and documented in an EHR. The EHR are clinical records, not claims data, and thus are robust with regard to data quality and completeness. The membership of reproductive-aged women  includes women with KP commercial insurance (varying copays, varying deductible levels), MediCal, and other California state subsidized programs. Within KPNC, there are 16 delivery hospitals and approximately 38,000 pregnancies each year.

Recruitment of participants
The RPGEH Pregnancy Cohort recruitment began in February 2010 at the KPNC Walnut Creek outpatient medical facility. Recruitment gradually expanded to cover almost the entire KPNC service area. Figure 1 shows the geographical locations of RPGEH pregnancy cohort members in an area of over 28,000 square miles, an area slightly larger than South Carolina. Clinical staff, such as medical assistants and nurses at the Obstetrics and Gynecology department, routinely gives a RPGEH pregnancy cohort flyer with frequently asked questions and a consent form to women at the initial prenatal visit. They also briefly describe the RPGEH Pregnancy Cohort and ask women if they would like to participate. If the woman agrees to participate and signs the consent form, the clinic staff places the research blood draw order in the woman's EHR.

Biospecimen collection and storage process
Women who consent have blood drawn for research purposes into one 8.5 mL serum separator tube (SST) tube and one 6.0 mL ethylenediaminetetraacetic Acid (EDTA) tube at the same time as the clinically ordered blood tests at her local KPNC laboratory at two times during their pregnancy: in the first trimester during a standard first trimester panel or genetic screening (~10-13 weeks, 6 days) and during the second trimester either along with standard genetic screening (~15-20 weeks) or with the gestational diabetes screening (~24-28 weeks). The blood tubes are couriered as part of the normal KPNC laboratory system to the Regional Laboratory, where they are transferred to the RPGEH Biorepository (see description below) and further processing occurs.

The RPGEH Research Biorepository
The RPGEH Biorepository is a state-of-the-art research biorepository and staffed with research laboratory personnel who are responsible for maintaining the laboratory space, checking in, processing and storing samples, and retrieving aliquots for studies. Equipment includes an ABF 500 automated blood fractionation robot unit, an RTS A4 temperature and humidity controlled robotic ambient storage unit for archiving DNA using Biomatrica DNA stable storage medium, and a walk-in−80°C freezer. A custom developed Laboratory Information Management System (LIMS) tracks specimens at each step and is linkable to RPGEH operations and clinical information databases.
Once at the Biorepository, serum from the SST is aliquotted into 4, 0.8 mL cryovials. The EDTA tube is centrifuged and plasma is aliquoted into 2, 0.8 mL cryovials, while 1.0 mL of buffy coat is aspirated and placed in a cryovial. All cryovials are stored at−80°C.

Clinical data
Information on participants in the Pregnancy Cohort is obtained from several sources of rich clinical data (resources are described below).
Information obtained from the EHR during the first prenatal visit As part of routine prenatal care, all pregnant women complete a prenatal questionnaire during the first trimester or shortly after the pregnancy is clinically confirmed. This questionnaire includes questions on parity, gravidity, prior delivery and birth history, reproductive history, menstrual history, prior medical history, social circumstances (e.g., stress, domestic violence, etc.), and an Adult Outcomes Questionnaire (AOQ) which includes the PHQ-9 [9, 10] depression screener and the Generalized Anxiety Disorder scale (GAD-2) [11] as well as functioning items. The information from the Prenatal Questionnaire is recorded in the KPNC EHR for access to extensive health and reproductive history on the cohort. Several other sources of pre-pregnancy information are available in the EHR including pre-pregnancy body mass index (BMI) if a woman had been a KPNC member prior to conception.

Early start substance use data
In addition to the Prenatal Questionnaire, a selfadministered Early Start Program Prenatal Substance Use Screening Questionnaire is completed at entry into prenatal care. Early Start is an integrated prenatal program to intervene when a pregnant woman reports alcohol, tobacco and other drug use during pregnancy [12]. The questionnaire asks about substance use before pregnancy and since pregnancy began, including alcohol, smoking, and prescription drug use.

Clinical data available in KPNC EHR
KPNC maintains complete databases that capture all encounters including hospitalizations, outpatient visits, radiology/imaging, laboratory tests, and prescription medications and combines these data for presentation to clinicians as part of the EHR. Data captured in these databases include inpatient and outpatient diagnostic information, imaging reports, laboratory tests and results, pharmacy dispenses including dosages and days of supply, and surgery outcomes, among others. All vital signs including weight and height, blood pressure and physical activity are recorded in the EHR. As noted above, these data are clinical information maintained in an EHR and are not claims data, and enable the detailed examination of diagnoses and treatments before, during and after pregnancy. In addition, when an infant is born, he/she is issued a unique medical record number (MRN) that is used for all care associated with the infant. It is linkable to the mother's unique MRN that allow identification of the mother-infant pair. This allows us to also link to the women's infants and examine infant growth and outcomes at birth and during childhood, along with other health outcomes, including the mother's outcomes.

RPGEH pregnancy cohort questionnaire
To obtain more detailed information not captured in the HER, each participant is invited to complete the RPGEH Pregnancy Cohort questionnaire. The RPGEH questionnaire ascertains information about a variety of sociodemographic, lifestyle and environmental factors not routinely captured in the EHR, including diet, physical activity, multivitamin use and self-reported health history before and during the study pregnancy (Additional file 1).

Environmental exposure data
Over 98 % of the RPGEH pregnancy cohort has been successfully geocoded and can be linked to contextual or environmental data, including spatiotemporal data that exist in public access databases. These data come from commercial sources, non-profit agencies, and local, regional, state and national government agencies. Data from these various sources are being incorporated into a KPNC geographic information system (GIS) database using Arc-GIS software (Redlands, CA). The database will include data on retail food outlets, green space, infrastructure (roads, educational facilities, health delivery centers, and public assistance facilities), traffic density, air pollution, pesticide use, toxic sites, toxic release inventories, and other factors. Other relevant information, currently located at other agencies but available for linkage, includes water quality, centers of social congregation (e.g., religious or spiritual institutions, senior centers, youth activity centers, etc.), and crime data. California has some of the most complete publically available geospatial data across these environmental factors anywhere in the world.
Below we describe the sources used for determining the clinical outcomes of the RPGEH Pregnancy Cohort participants and non-participants for this preliminary report.
Clinical outcomes during pregnancy/in utero exposure to maternal metabolism Women's body mass index and gestational weight gain Through the EHR we are able to capture a woman's body mass index prior to pregnancy as well her gestational weight gain trajectory and total gestational weight gain, allowing us to assess possible determinants of gestational weight gain, as well as to define the sequelae of in utero exposure to maternal obesity and excessive gestational weight gain (i.e., over nutrition) or inadequate gestational weight gain (i.e., undernutrition) in relation to the current Institute of Medicine guidelines [13] on child health.

Pregestational diabetes and gestational diabetes mellitus (GDM) and impaired glucose tolerance
Pregestational diabetes is obtained from the KPNC Diabetes Registry [14] and GDM is obtained from the KPNC pregnancy glucose tolerance and GDM Registry [15]. These registries allow for the identification of GDM based on objective glucose measurement defined according to laboratory glucose values meeting the Carpenter and Coustan diagnostic criteria [16].

Preeclampsia/Hypertensive disorder of pregnancy
Preeclampsia and hypertensive disorders of pregnancy were also obtained from the EHR and were defined according to the following ICD-9 codes: pre-existing hypertension 642.0-642.2, gestational hypertension 642.3, preeclampsia or eclampsia 642.4-642.7. The validity of these ICD-9 codes to diagnose hypertensive disorders of pregnancy has previously been reported [17].

Clinical outcomes at birth Preterm birth
Gestational age is based on the estimated date of delivery recorded in the EHR, which is determined by the woman's self-reported last menstrual period (LMP), or by first trimester ultrasound if different from the LMP-based calculation by more than 1 week. Preterm birth was defined as birth at <37 weeks' gestation. We also examined the degree of preterm birth using the following definitions: extreme preterm (<28 weeks' completed gestation), severe preterm (28-31 weeks' completed gestation), moderate preterm (32-33 weeks' completed gestation) and late preterm (34-36 weeks' completed gestation) [18].

Infant size for gestational age
Infant birthweight was obtained from the EHR. Large for gestational age was defined as birthweight >90th percentile and small for gestational age was defined as birthweight <10th percentile for the underlying KPNC population's race-ethnicity and gestational age-specific birthweight distribution [19].

Cesarean delivery
Cesarean delivery information was obtained from the KPNC neonatal and infant cohort [20] and is defined according to ICD-9 codes 654.2× for delivery mode recorded in the EHR.

Recruitment to date and prevalence of outcomes of interest
Between February 2010 and October 2014, pregnant members of KPNC aged 18 or older who initiated prenatal care at a KPNC medical facility participating in the pregnancy cohort were invited to participate in the RPGEH pregnancy cohort. Among the 93,409 pregnancies occurring at medical facilities participating in the RPGEH pregnancy cohort during this initial recruitment period, 16,977 RPGEH pregnancy cohort consent forms were received, which represents a participation rate of 18.2 %. Compared to non-participants, women who participated in the pregnancy cohort were similar in age, but were more likely to have initiated prenatal care in the first trimester and to be non-Hispanic white and were slightly less likely to be Asian (Table 1). RPGEH pregnancy cohort participants were KPNC members for an average of 10 years before their pregnancy (Table 1). Among the 16,977 RPGEH pregnancy cohort participants who delivered a liveborn infant at the time of this writing, 93.2 % had a first trimester blood draw (mean gestational age: 9.1 weeks +/−4.2 SD) and 80.5 % had a second trimester blood draw (mean gestational age: 18.1 weeks +/−5.5 SD) and 77.6 % had blood drawn in both trimesters.
Of the 93,409 pregnancies initially identified, 80,086 (84 %) delivered an infant in Kaiser Permanente Northern California. Of the pregnancies not resulting in livebirths, 5.7 % were due to pregnancy loss, 4.6 % no longer had Kaiser medical coverage, and 4.0 % delivered outside of Kaiser. Among the deliveries in Kaiser Permanente Northern California, the prevalence of preterm birth (<37 weeks), cesarean delivery, small for gestational age, large for gestational age, macrosomia, preeclampsia, GDM and NICU admissions was similar between RPGEH pregnancy cohort participants and non-participants (none of these outcomes differed by more than 1.2 %; see Table 2).
Participants were slightly more likely to be screened for GDM (95.6 % versus 93.1 %). Overall, participants and non-participants were very similar in their behavioral risk factors assessed on the Early Start Questionnaire at the first prenatal visit (Table 3). Participants and non-participants did not differ in terms of smoking during the 12 months before pregnancy or during pregnancy. However, participants were slightly more likely to report drinking alcohol both before pregnancy ( Table 3).
The use of RPGEH Pregnancy Cohort specimens and data are governed by the guiding principles of use and access established by the RPGEH. These principles include: 1) promote good science for the benefit of the public; 2) protect participant confidentiality and privacy; honor commitments made to participants and act within the scope of their consent; and preserve the trust that KPNC members have in KPNC; 3) comply with applicable legal and regulatory requirements; 4) consider whether the Resource is the best or only resource to address proposed research questions; 5) conserve limited materials or resources for high-value research, such as biospecimens, which can be exhausted, and use of biospecimens that are rare or of higher value because of the data associated with them; 6) ensure that an investigator at the KPNC Division of Research (DOR) is involved in the research question and the conduct of the study to ensure the right and appropriate use of the

resources. Applications for use of RPGEH Pregnancy
Cohort samples and data are submitted and reviewed by the RPGEH Access Review Committee (ARC). The ARC meets three times a year to review applications for use of RPGEH data and specimens. The ARC includes DOR investigators, plus external stakeholders and investigators to address specific content and methodological issues as required by the projects under consideration. The ARC governs access to and use of all RPGEH data and specimens by requestors.
Applications for access will be subject to three phases of review, and the ARC's decisions are made based on a formalized set of criteria that can be reviewed.

Statistical analyses, power and sample size considerations
Based on our expected cohort size of 25,000 women we computed power for hypothetical case-control studies. We assumed all available cases will be included and controls will be sampled at a ratio of 5:1. We computed the minimum detectable odds ratio (OR) for a two-sided test at level 0.05 and 80 % power for several outcomes with different prevalences. For the outcome of small for gestational age (prevalence 9.3 %) a case-control study will be powered to detect an OR of 1.15. For the outcome of gestational hypertension (prevalence 4.1 %) a casecontrol study will be powered to detect an OR of 1.22.
For the rare outcome of very low birthweight (prevalence 0.7 %) a case-control study will be powered to detect an OR of 1.57.

Discussion
This report provides a brief overview of the establishment of the KPNC RPGEH Pregnancy Cohort and its biorepository, which were created to provide a resource  The establishment of this valuable resource has the potential to address many key questions related to women's and children's health and is particularly timely, in light of the recent dissolution of The National Children's Study. The National Children's Study (NCS) was developed after a 1990s White House Task Force highlighted the paucity of evidence evaluating the links between environmental exposures, development, and health outcomes in children and adults. The Children's Health Act of 2000 initiated the conduct of a national longitudinal study of environmental influences (including physical, chemical, biological, and psychosocial) during pregnancy on child health and development. A recent report explains that this study was dissolved due to feasibility and oversight issues [21,22] and suggests that funding agencies support smaller focused studies designed as tailored explorations as well as cohorts to facilitate longitudinal biospecimen collection and banking.
This large pregnancy cohort, derived from a diverse base population, can be used to generate sets of cases and controls for future clinical research studies, as demonstrated by our preliminary data. The availability of rich clinical data from the EHR, the questionnaire data, and existing perinatal research programs provide detailed phenotypic information that will further facilitate the conduct of perinatal epidemiology and translational studies. The RPGEH Pregnancy Cohort, coupled with the state of the art KPNC Biorepository for long-term storage of serum, plasma and DNA samples and an ability to follow both women and their child long term for future health outcomes in the EHR, provides a truly unique and valuable resource for improving our understanding of women and children's health.  Our preliminary data on the RPGEH Pregnancy Cohort demonstrate that at least 18.2 % of pregnant women participated, and the cohort is highly representative of the underlying KPNC pregnant population in terms of both maternal demographics and key perinatal outcomes. Pregnancy cohort participants were KNPC members on average 10 years before their index pregnancy and remained members on average 2.7 years after pregnancy to date, and most are still currently KPNC members. Thus, there is a unique ability to examine exposures even years before pregnancy and to follow women and their infants for years after delivery. While participating women were slightly more likely to be non-Hispanic white and less likely to be Asian, this pattern is frequently observed in cohort studies with multiethnic populations such as KPNC women of reproductive age. Overall, the RPGEH Pregnancy Cohort is extremely diverse, with 53 % of participants from non-white racial ethnic minority groups, and Asian women comprise 23 % of the cohort. This is especially significant as Asian women have previously been reported as less likely to participate in reproductive and biospecimen research [22,23]. The racial-ethnic diversity of this population provides important potential for studies examining racial-ethnic disparities in diseases and health care delivery. Given the recruitment efforts integration within clinical care, it is possible that not all pregnant women at participating medical facilities were invited to participate in the pregnancy cohort; therefore, 18.2 % is likely an underestimate of the overall participation rate.
The prevalence of several perinatal complications was similar between RPGEH cohort participants and the underlying populations of women delivering in KPNC. Cohort participants were slightly less likely to have gestational diabetes mellitus (GDM) and infants of participants were slightly more likely to be macrosomic relative to non-participants. The lower prevalence of GDM among RPGEH participants is probably due in part to the fact that participants were less likely to be Asian and more likely to be non-Hispanic white; in this setting, Asian woman have the highest prevalence of GDM [15,24] and non-Hispanic white women have the lowest prevalence of GDM.
The fetal origins of adult disease hypothesis posits that "fetal programming" occurs when maternal metabolic nutrition, environment and hormonal milieu during development permanently programs the structure and physiology of organs and hence the future health of the offspring [25]. While there is some epidemiologic evidence supporting the "fetal programming" hypothesis, more longitudinal, observational studies examining the effects of a broad range of environmental and biological factors assessed in utero are needed to clarify the extent to which fetal programming contributes to adult diseases. In addition, a woman's health status during pregnancy may also influence her future health [26]. For example, women diagnosed with pregnancy-related hypertension and/or preeclampsia, gestational diabetes and preterm birth are at higher risk for hypertension, diabetes and cardiovascular disease later in life [7]. Therefore, given the rich health data in the KPNC EHR, the RPGEH Pregnancy Cohort will also allow for a lifecourse research approach [27].
The resource is available to be used by Kaiser Permanente researchers as well as outside investigators who wish to collaborate with a Kaiser Permanente researcher to conduct biomarker, genetic, environmental and gene environment interaction studies. The RPGEH Pregnancy Cohort has the unique ability to connect biospecimens collected at two time points during pregnancy with detailed short-and long-term environmental and clinical data on both women and their children, enabling research of immediate perinatal complications as well as longer term maternal, child, and adult outcomes.