Venue: The Fuqua School of Business, Duke University, 1 Towerview Drive, Durham, NC 27708-0120
Presentation
When Analyzing Health Expenditure Data, Is One-Part Enough?
Many outcomes (y) studied empirically in health economics have two fundamental statistical properties: (a) y>0; and (b) the outcome y=0 is observed sufficiently frequently that the zeros cannot be ignored econometrically. Such data structures are observed in many health care data but is especially prevalent and troublesome in health care expenditure data, where y>0 is often highly skewed distribution resulting from few patients with high health care utilization. Given exogenous covariates x, econometric applications in which such data structures are encountered have often relied on the two-part model (2PM). The two-part model 2PM assumes that Pr( y>0| x) is governed by a parametric binary probability model like logit or probit (part one) , and either E(ln( y)| y>0;x)=xb (e.g., Ordinary Least Squares (OLS) model with log-transformed y) or E( y| y>0, x) = exp(xb) (e.g., Generalized Linear Model (GLM) with a log link and Gamma distribution) as part two. Although log transformation improves precision and diminishes outlier effect, log scale results, per se, are of little interest. The complications of retransformation are well documented in the literature, especially when there is evidence of heteroscedasticity due to continuous variables. Manning presented strong evidence that the failure to account for heteroskedasticity can have misleading policy implications. For these reasons, a GLM with a log link has been proposed for estimating E( y| y>0, x). In analyzing health care expenditure data, the analyst must decide whether to use a one-part model (1PM) or 2PM. While the choice among competing 2PM estimation strategies has been addressed in the literature, the 1PM has a certain attractiveness as it is fit to the data for all observations, despite whether they used any services. Despite the attractiveness of such a simple model, questions remain as to the performance of such a model when the proportion of observations with y=0 is large and data is highly skewed. One of the main objectives of this paper is to assess the influence of proportion of zeros on the model choice. In particular, what is the proportion of zeros that warrants a 2PM rather than a 1PM? This choice may also be influence by the distribution of the data. Though a series of simulations based on observational data, this paper compares the performance of the 1PM and 2PM (both based on a GLM with a log link) across two dimensions: 1) proportion of y=0 and 2) distribution of y for y>0.