Thinking outside the curve, part I: modeling birthweight distribution

Background Greater epidemiologic understanding of the relationships among fetal-infant mortality and its prognostic factors, including birthweight, could have vast public health implications. A key step toward that understanding is a realistic and tractable framework for analyzing birthweight distributions and fetal-infant mortality. The present paper is the first of a two-part series that introduces such a framework. Methods We propose describing a birthweight distribution via a normal mixture model in which the number of components is determined from the data using a model selection criterion rather than fixed a priori. Results We address a number of methodological issues, including how the number of components selected depends on the sample size, how the choice of model selection criterion influences the results, and how estimates of mixture model parameters based on multiple samples from the same population can be combined to produce confidence intervals. As an illustration, we find that a 4-component normal mixture model reasonably describes the birthweight distribution for a population of white singleton infants born to heavily smoking mothers. We also compare this 4-component normal mixture model to two competitors from the existing literature: a contaminated normal model and a 2-component normal mixture model. In a second illustration, we discover that a 6-component normal mixture model may be more appropriate than a 4-component normal mixture model for a general population of black singletons. Conclusions The framework developed in this paper avoids assuming the existence of an interval of birthweights over which there are no compromised pregnancies and does not constrain birthweights within compromised pregnancies to be normally distributed. Thus, the present framework can reveal heterogeneity in birthweight that is undetectable via a contaminated normal model or a 2-component normal mixture model.

Simple bell curves are inadequate characterizations of birthweight distributions [17,11,[18][19][20]. Wilcox and Russell proposed a contaminated normal model, in which a predominant normal distribution accounts for most birthweights while a contaminating residual distribution yields most VLBW and ELBW cases [21]. The residual distribution does not have a specific structure and, in particular, is not normal. The contaminated normal model was later extended by Umbach and Wilcox to accommodate two residual distributions, one yielding excess births in the left tail and the other in the right tail [22]. Gage and Therriault took a different approach, employing a 2-component normal mixture model [23]. A primary normal distribution accounts for most birthweights, while a secondary normal distribution is linked not only to most VLBW and ELBW cases but also to many HBW cases. The 2-component normal mixture (resp., contaminated normal model) dichotomizes birthweights: those arising from the primary distribution (resp., predominant distribution) are conceptualized as reflecting ordinary fetal development, while the rest are considered to signify compromised fetal development [24]. Gage also formulated a parametric mixtures of logistic regressions (PMLR) technique to evaluate heterogeneity in mortality associated with this dichotomy [24].
While the aforementioned works demonstrate great insight, their statistical models have some limitations. In particular, the number of constituent distributions (predominant, residual, primary, secondary) is fixed a priori. If a constituent distribution can signify compromised fetal development [24], perhaps different biological mechanisms for compromised fetal development warrant a model with more than two or three constituent distributions. Likewise, perhaps more than two or three birthweight-specific mortality curves are needed to describe heterogeneity in mortality.
The present paper is the first in a two-part series that introduces a new framework for modeling birthweight distribution and fetal-infant mortality. We propose a normal mixture model for birthweight distribution in which the number of components is not fixed a priori but rather determined from the data using the Flexible Information Criterion (FLIC) (Pilla and Charnigo, Consistent estimation and model selection in semiparametric mixtures, submitted) or another model selection technique [25,26]. In the companion paper, we show how to estimate birthweight-specific mortality within each component using a generalization of PMLR [24] and how to compare mortality across components within a single population or across populations within a single component. In both papers, we seek statistical models that provide an empirically reasonable fit to the data. However, the goal is not to find good fitting models for their own sake. Rather, such models may lead to better assessments of mortality.

Pragmatics for mixture modeling a. Finite normal mixture models
Many phenomena cannot be accurately described via a normal distribution. When no other commonly used probability distribution seems appropriate, a finite normal mixture model is often reasonable. We now briefly describe the model. Readers interested in theoretical developments may consult references [27][28][29][30] and works cited therein.
Let f(x; μ, s) denote the probability density for the normal distribution with mean μ and standard deviation s. A finite normal mixture model with k components has probability density A common way to interpret Equation (1) is to imagine that the full population consists of k subpopulations. The proportion of individuals in the full population belonging to subpopulation j is p j . In subpopulation j, measurements are normally distributed with mean μ j and standard deviation s j .
The mixture components may or may not represent subpopulations with obvious biological definitions outside the statistical model. For example, in a 2-component normal mixture describing birthweights for white singletons in the United States, there is not an obvious biological characterization for the two components: we may say that the component with the smaller mean reflects compromised pregnancies, but we cannot immediately attribute the compromised pregnancies to a specific biological mechanism.
Ideally, modeling with finite normal mixtures may lead to discoveries of subpopulations with biological definitions that were not immediately obvious, although the mixture components themselves may still only be approximations to such subpopulations.

b. Order selection and the flexible information criterion
Equation (1) may be an imperfect description of real data regardless of k, but with k sufficiently large the description may be adequate to address a problem of scientific interest. Conversely, if k is too large, the model may become unwieldy. Hence, a researcher with real data must confront the problem of "order selection" (i.e., choosing an appropriate number of components).
Let M denote the maximum number of components that a researcher is willing to accept. For 1 ≤ m ≤ M, let L m denote the maximum value of the likelihood attainable by an m-component normal mixture. The Akaike Information Criterion (AIC) [25], Bayesian Information Criterion (BIC) [26], and Flexible Information Criterion (FLIC) (Pilla and Charnigo, Consistent estimation and model selection in semiparametric mixtures, submitted) are Above, (3m -1) is the number of free parameters in an m-component normal mixture. Also, n denotes the sample size, δ the average fraction of within-component variability to total variability over the M normal mixtures fitted by maximum likelihood, and B(n,δ) a bivariate function taking values between 0 and 1 (Pilla and Charnigo, Consistent estimation and model selection in semiparametric mixtures, submitted). The criteria balance fidelity to the observed data against model complexity; models are preferred for which the criteria are smaller. Note that m indexes normal mixtures being judged by the criteria, while k pertains to a normal mixture that has been adopted for data analysis.
The FLIC is distinguished from the AIC and BIC in that its penalty term 2 3 1 (log ) ( ) ( , ) n m B n  − is determined not only by the sample size but also by the configuration of data points: a configuration suggesting greater heterogeneity allows a model with more components to be selected. The penalty term of the FLIC also depends on M, so that a researcher must specify M. In analyzing birthweight data, we fix M = 7 since having too many components would impede inference about mortality risk. The FLIC and AIC perform well for small samples, while the FLIC and BIC are better for large samples, so we prefer to rely on the FLIC (Pilla and Charnigo, Consistent estimation and model selection in semiparametric mixtures, submitted).

c. Computational procedures
To employ the FLIC, we must obtain maximum likelihood estimates of the proportions, means, and standard deviations in all finite normal mixture models under consideration. For models with more than one component, numerical optimization procedures must be used. We apply the expectation maximization (EM) algorithm to obtain preliminary estimates [31], followed by the optimization (optim) procedure in version 2.3.1 of R (R Foundation for Statistical Computing, Vienna, Austria, 2006) to acquire final estimates. Our R code is available upon written request to the corresponding author. See Section I of [Additional file 1] for details on using EM and optim, including initial value specification.

Analyzing birthweight data with the FLIC a. A FLIC-selected model and competitors
To exemplify use of the FLIC, we draw a random sample of size 50,000 from the 202,849 white singletons who were born (or experienced fetal death) from 2000 to 2002 and whose mothers smoked heavily (at least twenty cigarettes per day). Since records with birthweights less than 500 grams or gestational ages less than 22 weeks were not consistently documented [32], we require infants in our sample to have known gestational ages of at least 22 weeks and birthweights between 500 and 5500 grams. The data source is the National Center for Health Statistics (NCHS) Public-Use Perinatal Mortality Data Files.
The FLIC selects a 4-component model (Figure 1a), Component 3 is loosely analogous to the predominant distribution in the contaminated normal model [22] and the primary distribution in the 2-component model [23]. Component 1 in the 4-component model includes ELBW and VLBW cases, component 2 contains mostly MLBW and NBW cases but also some VLBW and HBW cases, and component 4 comprises NBW and HBW cases.
Next we fit the contaminated normal and 2-component models to the same data set. For the contaminated normal model, we take the bin width to be 200 grams and use the BIC to select the number of contaminated bins [22]. Approximately 2.5% of cases are assigned to the lower residual distribution (threshold: 1700 grams), 97.5% to the predominant distribution (estimated mean and standard deviation, 3168 and 488 grams), and less than 1 in 8700 to the upper residual distribution (threshold: 5300 grams). Regarding the 2-component model, approximately 88.0% of cases are assigned to the primary distribution (estimated mean and standard deviation, 3186 and 458 grams) and 12.0% to the secondary distribution (estimated mean and standard deviation, 2617 and 951 grams).
The fitted contaminated normal, 2-component, and 4component models are compared in Figure 2. The contaminated normal model fits the ELBW and VLBW data nicely but exhibits artifacts at the thresholds of 1700 and 5300 grams; the contaminated normal model also understates the HBW data. The 2-component model provides a good fit at most birthweights but severely understates the ELBW data. The 4-component model avoids these weaknesses but has an exaggerated peak near the component 1 mean.

b. Reproducibility of order selection
In the preceding example, the selection of a 4-component model was based on a specific sample of 50,000 white singletons whose mothers smoked heavily. If we draw another sample of size 50,000, will the FLIC express the same preference?
We can address this question by drawing N rep samples of size 50,000 with replacement and applying the FLIC to each sample. Here "with replacement" means that an infant can appear in more than one sample, not that an infant can appear twice in the same sample. The frequency with which the FLIC prefers a 4-component model indicates the reproducibility of order selection. Table 1 shows the verdicts of the FLIC and other criteria for N rep = 25 samples of size 50,000. The FLIC prefers a 4-component model for 22 out of 25 samples; for the other three samples, the FLIC narrowly prefers a 6component model. The verdicts of the BIC match those of the FLIC. The AIC is equivocal between 6-component and 7-component models. Table 1 also identifies the preferences of the FLIC for sample sizes smaller than 50,000. The tendency to favor simpler models at smaller sample sizes can be understood by analogy to a hypothesis test. Imagine testing a null hypothesis that there are two components against an alternative hypothesis that there are more than two components: as the sample size decreases, the power to reject a false null hypothesis also decreases.

c. Uncertainty in parameter estimation
Although we may be comfortable using a 4-component model for the birthweights of white singletons whose mothers smoked heavily, Equation (5) does not convey the uncertainty in the parameter estimates for that model.
To assess uncertainty in parameter estimation, we fit k-component models using each of N rep samples of equal size; in our example, k = 4 and there are N rep = 25 samples of size 50,000. Let θ represent a parameter of interest, such as μ 3 , and let   , , ,  N rep and serving as an overall estimate of θ, and with S ∧  denoting the corresponding standard deviation, we can define a confidence interval via , , ,  N rep were normally distributed with expected value θ, then for 95% confidence we should choose C as the upper .025 quantile of the standard normal distribution (or of a T distribution); in the absence of normality, to be conservative we could choose C = = 1 05 447 / .
. based on Chebychev's inequality [33]. However, not even C = 5.0 yields a coverage probability of 95% (see Section 3b of Results , , ,  N rep are not independent due to the large overlaps among the N rep samples. The first problem can be addressed by modifying Equation (6) to where B ∧  denotes the estimated absolute value of the bias [34]. Our approach to acquiring B ∧  is simulationbased. We simulate a birthweight data set from  population that each of the N rep samples constitutes. Let C 0 denote the value of C that would be chosen if this fraction were negligibly small, and let C denote the value that would be chosen if this fraction were equal to , a positive number less than 1. In Section II of [Additional file 1], we show that Section II of [Additional file 1] also explains why we sample with replacement, why we sample instead of using the full population, and how to compare parameters within and between populations. Table 2 lists overall estimates and confidence intervals for parameters in a 4-component model for the birthweights of white singletons born to heavily-smoking mothers, using Equations (7) and (8) with the same N rep = 25 samples of size 50,000 in Table 1, C 0 = 2.5 (see Section 3b of Results), and = .2465 = 50,000/202,849. Figure 1b displays the mixture model implied by the overall estimates in Table 2. Section III of [Additional file 1] examines how the overall estimates and confidence intervals change when the sample size is less than 50,000.

Further illustrations a. Simulation study on model selection
For our first simulation study we generated 25 nonoverlapping data sets of size 5000 from designs A through E in Table 3; see also Figure 3. Designs A through E represent the fitted 2-through 6-component models derived from the 25 samples of size 50,000 in Table 1. Values in the data sets less than 500 or greater than 5500 were discarded since the 2-through 6-component models were meant to mimic a birthweight distribution; new values were drawn as needed to complete the data sets. We assessed how often the FLIC, BIC, and AIC recovered the correct number of components. This was repeated for data sets of different sizes up to 100,000.  The columns "FLIC 5000", "BIC 5000", and "AIC 5000" contain the preferences of the three model selection criteria for the number of components in a normal mixture model for birthweight distribution, based on 25 samples of size 5000 from the population of white singletons born to heavily smoking mothers. The next nine columns correspond to sample sizes of 10000, 25000, and 50000.  (7) and (8) with C 0 = 2.5 and = .2465. Table 4, the FLIC and BIC consistently returned the correct answer with the 2-component model at a sample size of 5000, the 3-component model at a sample size of 10,000, and the 4-component model at a sample size of 25,000. The FLIC and BIC did not consistently return the correct answer for the 5-component or 6-component model at any sample size, although they occasionally detected components 5 and 6 at a sample size of 100,000. The AIC was erratic.

As shown in
At larger sample sizes, the FLIC and BIC routinely claimed a third (non-existent) component for the 2component model. We attribute this to the removal of values less than 500 or greater than 5500, after which the 2-component model was, strictly speaking, no longer a normal mixture but rather a truncated normal mixture.

b. Simulation study on calibrating confidence intervals
For our second simulation study we generated 25 overlapping data sets of size 50,000 from design C in Table  3, the degree of overlap consistent with a population of 200,000. For each of various C between 2.0 and 5.0, we used Equation (7) to form confidence intervals for the mixture parameters p 1 , p 2 , p 3 , p 4 , μ 1 , μ 2 , μ 3 , μ 4 , s 1 , s 2 , s 3 , s 4 . We recorded how many of the mixture parameters were contained in their respective confidence intervals. This was repeated nine more times, and we tabulated how many of the 120 = 12 × 10 confidence intervals contained their targets. Confidence intervals were also formed using Equation (6) for comparative  purposes. The above steps were repeated with overlapping data sets consistent with a population of 1,000,000 and with nonoverlapping data sets consistent with an effectively infinite population.
The results are summarized in Table 5. With an effectively infinite population, only 81.7% of the confidence intervals formed using Equation (6) contained their targets at C = 5.0. The confidence intervals formed using Equation (7) contained their targets 95.0% of the time at C = 2.5. The adjustment suggested by Equation (8) appears reasonable: = .05 = 50,000/1,000,000 and N rep = 25 yield C = 1.315 C 0 , which accords with the 95.8% capture of mixture parameters at C = 3.5 ≈ 1.315 × 2.5 with a population of 1,000,000.

c. Another example with real data
We also drew 25 samples of size 50,000 from the 1,749,827 black singletons who were born (or experienced fetal death) from 2000 to 2002, regardless of maternal smoking status. Table 6 records the frequencies with which the FLIC selected the 2-through 7-component models as well as the overall estimates of component proportions, means, and standard deviations for each of these models. The 6-component model was overwhelmingly preferred by the FLIC. Figure 4 juxtaposes the fitted 4-component and 6-component models implied by the overall estimates. The four components in the 4-component model are loosely analogous to the second through fifth components in the 6-component model, so that the main rationale for adding two more components appears to be providing a more elaborate description of the far left and right tails of the birthweight distribution.

Discussion
Our approach to modeling birthweight distribution is distinguished from previous proposals in that the data determine the number of components in the normal mixture model. We have seen that data sets of size  2  3  4  5  6  7  2  3  4  5  6  7  2  3  4  5  6  7   2  5000  25  0  0  0  0  0  25  0  0  0  0  0  7  9  1  3  4  50,000 for white singletons born to heavily-smoking mothers typically warrant 4 components, while data sets of size 50,000 for black singletons usually demand 6 components. These results underscore the idea that a one size fits all paradigmwhether that be a 2-component normal mixture model or even the across the board use of a 4-component normal mixture model may lead to unreasonable representations of birthweight distribution for some populations. Our approach, on the other hand, allows birthweight distribution to be   described differently for different populations. We also note here that, although results have not been presented in this paper for a full spectrum of populations, our experience has been that data sets of size 50,000 usually call for between 3 and 6 components. The second paper in our two-part series will elucidate the main advantage of our approach over the contaminated normal model [21,22] and the 2-component model [23], namely its greater potential to expose heterogeneity in mortality risk. By this we mean that, even at a fixed birthweight, some infants may be at higher risk than others. While such heterogeneity seems plausible, if not altogether obvious, it may not be adequately expressed by either the contaminated normal model or the 2-component model. Hence, allowing a model to have more than 2 components is not an intellectual exercise or fitting the data for the sake of fitting the data but rather a way to improve assessment of mortality.
Since gestational age is sometimes considered in tandem with birthweight [19,20], we now comment on its relation to the methodology in this two-part series.
Our approach to modeling birthweight distribution does not explicitly consider gestational age. However, our experience is that the first component typically captures most very preterm births. For instance, the birthweight distribution for white singletons with gestational ages > 37 weeks is well approximated by a 3-component model whose components resemble the second through fourth components of a 4-component model for white singletons in general.
Even so, one may be interested in extending our methodology to explicitly consider gestational age and/ or other covariates. We envisage at least two possible extensions. The first would generalize the work of Fang, Stratton, and Gage [19] in which the number of components had been constrained a priori to two, while the second would be novel.  The first extension would be to model the joint probability density of birthweight and gestational age as a bivariate normal mixture, with the number of components determined from the data using the FLIC rather than being constrained a priori to two. Then, instead of estimating the mortality risk within each component as a function of birthweight only, one could estimate the mortality risk within each component as a function of both birthweight and gestational age.
The second extension would be to retain the univariate normal mixture model for birthweight distribution but create auxiliary models to relate covariates, such as gestational age, to mixture components. The appeal of this extension is that it could allow some mixture components to be placed in approximate correspondence with identifiable subpopulations.

Conclusions
The present paper, the first in a two-part series, develops a new and flexible approach to modeling a birthweight distribution using a normal mixture model with the number of components determined from the data rather than fixed a priori. This approach allows the detection of heterogeneity in birthweight that cannot be found with a contaminated normal model or a 2-component normal mixture model. Unlike a contaminated normal model, our approach does not assume the existence of an interval of birthweights over which there are no compromised pregnancies. Unlike a 2-component normal mixture model, our approach does not constrain birthweights within compromised pregnancies to be normally distributed. Yet, better modeling of birthweight distribution is a means to an end, namely a greater understanding of fetal-infant mortality. The second paper in our two-part series reveals that, when coupled with methodology for estimating birthweight-specific mortality curves within each component, this paper's approach to describing a birthweight distribution can also reveal heterogeneity in mortality.

Methods
[Additional file 1] presents technical details on our methodology and its implementation.

Additional material
Additional file 1: Technical Appendix. Additional file 1 presents technical details on our methodology and its implementation.