Venue: The Fuqua School of Business, Duke University, 1 Towerview Drive, Durham, NC 27708-0120
Presentation
Estimation Comparisons of Linear Cost Models with Skew
Total healthcare cost for an individual patient equals the sum of the resources used by the patient multiplied by the unit cost for each resource. Frequently, the distribution of cost across patients is skewed because a few patients in the population have much-larger-than-average cost (e.g., treatment of rare clinical conditions). If the circumstances associated with the skewed cost can be identified and controlled for, the regression residuals may still be normally distributed; cost can be modeled using ordinary least squares (OLS) to generate unbiased and efficient parameter estimates. If the source of the skewed cost cannot be identified, and the regression residuals are skewed, OLS estimates remain consistent and normally distributed in large samples. However, in small samples, skewed residuals yield inefficient OLS estimates, and standard statistical tests may be misleading.
To deal with skewed dependent variables, researchers often transform the dependent variable using a natural logarithm. Researchers have also used general linear models (GLM) to provide more general dependent variable transformations and error distributions (Manning and Mullahy, 2001). Subsequent research suggested using extended estimating equations (EEE) to choose the best GLM specification (Basu and Rathouz, 2005). While these transformation methods are more general than those previously used, they assume a multiplicative relationship between the dependent and independent variables: the effect of a unit increase in an independent variable varies across the range of the variables specified in the model. However, these models may produce misleading results if the underlying cost model is linear in the independent variables with cost skew resulting from unmeasured rare conditions in a small percentage of patients.
Previous research compared the properties of log-transformed OLS models to various GLM specifications using underlying non-linear models as the source of their comparisons (Manning and Mullahy, 2001). To the best of our knowledge, no one has assessed the properties of various estimators with an underlying linear cost model. Here we generated linear cost data in which a portion of the population has excessive cost. In simulations we varied (1) the size of the excessive cost; (2) the percentage of the population with excessive cost; (3) the population sample size; and (4) the presence of heteroskedasticity. We examined seven frequently used models: OLS with a non-transformed dependent variable, OLS with a log-transformed dependent variable, GLM with gamma and poisson families and log and identity links, and the EEE model.
All models produced unbiased estimates of the parameter value associated with an independent variable at its mean, although some were slightly more efficient. However, differences appeared when marginal effects were estimated at values of the independent variable other than the mean. Because the data-generating process in this case was a linear relationship between the dependent and independent variable, the marginal effects should be constant. For non-linear models, statistically significant but incorrect marginal effects were estimated for values of the independent variable other than the mean. Thus, researchers need to consider if the variability in the marginal effects generated by non-linear models is appropriate for the cost data under consideration.