 Research
 Open Access
 Published:
Statistical models for the analysis of skewed healthcare cost data: a simulation study
Health Economics Review volume 5, Article number: 11 (2015)
Abstract
Skewed data is the main issue in statistical models in healthcare costs. Data transformation is a conventional method to decrease skewness, but there are some disadvantages. Some recent studies have employed generalized linear models (GLMs) and Cox proportional hazard regression as alternative estimators.
The aim of this study was to investigate how well these alternative estimators perform in terms of bias and precision when the data are skewed. The primary outcome was an estimation of population means of healthcare costs and the secondary outcome was the impact of a covariate on healthcare cost. Alternative estimators, such as ordinary least squares (OLS) for Ln(y) or Log(y), Gamma, Weibull and Cox proportional hazard regression models, were compared using Monte Carlo simulation under different situations, which were generated from skewed distributions.
We found that there was not one best model across all generated conditions. However, GLMs, especially the Gamma regression model, behaved well in the estimation of population means of healthcare costs. The results showed that the Cox proportional hazard model exhibited a poor estimation of population means of healthcare costs and the β_{1} even under proportional hazard data. Approximately results are consistent by increasing the sample size. However, increasing the sample size could improve the performance of the OLSbased model.
Background
Statistical models are often used in many healthcare economics and policy studies. The main issues in such studies are the estimation of mean population healthcare costs and finding the best relationship between costs and covariates through regression modeling [1]. However, these cannot be implemented by simple statistical models as the healthcare costs data have specific characterizations [2]. Healthcare costs data demonstrate the substantial positive skewness and are sometimes characterized by the use of large resources with zero cost [3]. These specifications of data impose a number of difficulties in using standard statistical analysis, such as implementing linear regression causes unreliable results [2].
Twopart models based on mixture models are performed when excess zeroes are present in data [3]. Further, logarithmic (or other) transformations are commonly used to decrease the skewness and drive them close to normal distribution, in order to implement linear regression models. The logarithmic transformation with ordinary least squares (OLS) regression is a very common approach in applied economics. However, it also presents several drawbacks. One of these drawbacks is that the predictions are not robust enough to detect the heteroscedasticity in the transformed scale [1,4]. The general consensus is that estimating the mean cost using a logarithmic regression model leads to biased estimation [2,46].
An alternative approach is using nonlinear regression models, of which exponential conditional mean (ECM) models in generalized linear models (GLMs) are examples [7]. Generally, GLMs extend the linear modeling framework to allow response variables that are not normally distributed. In healthcare studies, generalized linear modeling through loglink function avoids the weakness and problems of OLS regression. In addition, the Cox proportional hazards model has been a controversial issue for healthcare data modeling. It has been used as a special flexible model for skewed healthcare data in many studies [8,9].
In recent years, extensive research efforts have been done to propose suitable regression methods for the analysis of skewed healthcare data [1,3,10,11]. Many studies also set out a clear framework for comparing these methods from a variety of aspects [5,6,12,13]. Moreover, a few have provided prominent reviews of the statistical methods for analyzing healthcare data [2,7].
However, there is no comparative study that investigates the different methods using different sample sizes. This paper was conducted to compare the proposed statistical models in the available literature using different sample sizes. We specifically focused on comparing proposed statistical models for positive skewed healthcare costs, but not zero mass problems. It was developed based on a Monte Carlo simulation to find appropriate methods to get the unbiased and precise estimates of the mean costs. This aspect is particularly important in the literature [5,13]. Furthermore, in this paper, the coefficient estimations of covariates are also evaluated in our simulations using different sample sizes.
Methods
Let y _{ i } denote healthcare expenditures for person i, and x _{ i } denote the set of covariates, including the intercept. We estimated the following models.
Ordinary least square based on log transformation
It is common to use linear regression models for the log of costs in healthcare expenditures. Logarithmic transformation is most commonly used to decrease skewness and to make the distribution more symmetric and closer to normality. The log regression model is as follows:
Where it was assumed that E(xε) = 0 and E(ε) = 0, since predicting costs on the original scale is primary objective so:
If the error term is \( N\left(0,{\sigma}_{\upvarepsilon}^2\right) \) distribution, it is a lognormal case, and then:
However, if the error term is not normally distributed, but is homoscedastic, then the smearing estimator is applied.
Generalized linear models
GLMs are a broad class of statistical models for relating nonnormal dependent variables to linear combinations of predictor variables. An invertible link function (g (.)) converts the expectation of the response variable, E (Y _{ i } ), to the linear predictor:
The ECM model is a special type of GLM with loglink function, and can be viewed as a nonlinear regression model:
Weibull and Gamma regression models are assumed as two special types of ECM model; β values were estimated here using quasimaximum likelihood estimation. The exponential distribution was considered to be a special case of the Weibull and Gamma regression models when the shape parameter was equal to 1.
Cox proportional hazard model
The Cox proportional hazard model is based on hazard and survival functions, instead of ECM or direct estimation of E (yx). It is a semiparametric model because it does not specify the baseline hazard function:
Where h _{0}(y) is the baseline hazard, estimated using the Breslow method. The main issue in this model, which should be considered, is the proportional hazard assumption. This means that the hazard ratio of two individuals is independent of time [14]. Note that the interpretation of the estimated coefficients in this model is based on hazard ratio rather than the covariate effect on the mean.
Comparing model performance
The interested estimations are evaluated as follows:
Two statistics were calculated to evaluate the quality of cost predictions using above mentioned models. The first was the mean prediction error (MPE), which measures the bias and predictive accuracy, and the second was the mean absolute prediction error (MAPE):
Actually, MPE indicates how the mean of predicted healthcare expenditures from a particular model compares with the mean of healthcare costs. Models with lower values of MPE have smaller biases than models with higher values. However, MAPE indicates how values of individual predicted healthcare expenditures from a particular model compare with values of actual healthcare expenditures in the sample [6].
Mean square of error (MSE) and 95% confidence interval of the estimate of β_{1} coefficient were calculated to evaluate the accuracy and precision of the estimated parameter. A more precise estimator should be closer to the true value. A Goodness of fit test provided by HosmerLemeshow test and the Akaike information criteria (AIC) used as an aid to choosing between competing models. Lower values of the AIC indicate the preferred model criterion were also used to evaluate. The mean of the residuals across deciles of x was also plotted in order to assess a systematic bias in the predictions.
Simulation study
To compare the performance of the alternative models, a Monte Carlo simulation was used to show how each estimator behaves under different conditions of skewness that are common in healthcare expenditure studies.
To determine the effect of the level of skewness on the estimated outcome, some skewed probability density function (pdf), such as lognormal, Gamma and Weibull distribution, was used as a datagenerating mechanism. To assess how the sample size affects the estimations, 10,000 times batch samples (n = 25, 50, 100, 500 and 1,000) were examined by comparing all models mentioned in the previous section. All generated data were standardized according to Basu et al., in which β _{0} was considered as intercept, estimated assuming E(y) = 1.
Lognormal data generation
The true model assumed is as follows:
Where x is uniform (0, 1), ε ∼ N(0, σ ^{2}), in which σ ^{2} = 0.5, 1.0, 1.5, and β _{1} = 1 were used. β _{0} was estimated based on E(y) =1:
The skewness of lognormal models is an increasing function of the variance as follows:
We considered σ ^{2} = 0.5, 1, 1.5 and 2.
Gamma data generation
The pdf of Gamma distribution is:
Where b = exp(β _{0} + β _{1} x) and α are the scale and shape parameters, respectively. The mean is equal to αb and the skewness is a decreasing function of the shape parameter, as follows:
Where x is uniform (0, 1), β _{1} = 1 and β _{0} was estimated so that E(y) = 1. Also, we used the assumption that α = 0.5, 1, 2 and 4 in the data generating process.
Weibull data generation
Weibull data generation is considered as a function of the datagenerating mechanism, which has proportional hazard properties. It was used to generate proportional hazard data, since the Cox proportional hazards model is based on this assumption:
Where b = exp(β _{0} + β _{1} x) and α are the scale and shape parameters, respectively. The mean is equal to \( b\varGamma \left(1+\frac{1}{\alpha}\right) \) and the skewness is also a decreasing function of the shape parameter, as follows:
Shape parameter was considered as 0.5, 1 and 5 in this scenario. The proportional hazards assumption was evaluated in all of the simulations.
Results
Mean, standard deviation, skewness and kurtosis for the y in various datagenerating mechanisms are presented in Table 1. Based on this result, the lognormal and Weibull models provided greater skewness than the Gamma model. It should be noted that the skewness of data from the lognormal and Gamma models increased monotonically as the sample size increased.
The results in Tables 2, 3, 4, 5 and 6 were based on 10,000 times batch replication, in sample sizes of 25, 50, 100, 500 and 1,000, respectively. These tables show the results of the estimates of population means and β_{1} for each model under the various datagenerating processes. Minimum deviance (MPE) and absolute deviance (MAPE) of predicting the value of the response variable (healthcare costs) considered as adequacy of methods.
Generally, entire models exhibited lower MPE by declining skewness and increasing sample size. However, the Gamma regression model had the smallest biases across all datagenerating processes. Moreover, our results indicated that its ability to predict the expenditures in a small sample size was as good as for large sample sizes. Furthermore, OLS for Ln(y) and Weibull regression models showed a lower bias than the Cox proportional hazard model, even in proportional hazard datagenerating process (Figure 1).
In addition, evaluating MAPE as an accuracy measure showed that Gamma and Weibull regression models have almost equal MAPE values. In many conditions, such as the lognormal model with σ^{2} = 1.5, 2, the Gamma model with shape equal to 0.5 (monotonically declining pdf) and the Weibull model with shape equal to 0.5 (linearly increasing hazard), as higher skewness datagenerating mechanisms, the MAPE from Weibull regression model was always lower than Gamma regression model.
Interestingly, as the sample size increased, the MAPE of OLS for Ln(y) became very similar to that of the Gamma regression model. However, the MAPE of all models had an insignificant upward trend as the sample size increased.
Since there was also a concern about consistency and precision in the estimates of β_{1} coefficients, MSE and 95% simulation intervals were investigated. All three regression Gamma and Weibull and OLS for Ln(y) models provided approximately similar MSEs of β_{1} as data generated using log normal. However, the Gamma regression model showed minimum MSE values. We also found that MSE decreased by reducing skewness and increasing sample size. For the Weibullgenerated data, Gamma and Weibull regression models exhibited similar and minimum values of MSE. Under all datagenerating mechanisms, 95% simulation intervals were closer to true values in all three regression models. Surprisingly, the Cox proportional hazard model revealed maximum MSE and less accurate 95% simulation intervals, even within proportional hazards datagenerating scenario.
Comparison goodness of fit tests (HosmerLemeshow test and AIC criterion) revealed that, under a different range of data conditions, Gamma and Weibull regression models were better behaved. Finally, investigation of the pattern of the residuals as a function of X, which have been implemented by the mean of the residuals across deciles of X, showed more bias for the Cox proportional hazard model across all generated data and sample sizes (see Figures 215).
Discussion
Although there are many substantial studies addressing the statistical issues in healthcare cost analysis over the last few decades, it is still an important issue that needs further evaluation. In this paper, we assessed the performance of various wellknown statistical regressionbased models in healthcare expenditure analysis, through different sample sizes and datagenerating processes, using a Monte Carlo simulation. Each model was evaluated on 10,000 batch random samples, with 25, 50, 100, 500 and 1,000 sample sizes. Other studies were performed using just one large sample size (such as 10,000) [5,10], while we know the sample size is an important issue in healthcare studies and in precision of modelbased estimators.
We found that, by considering the different interest points of various research and various data conditions, different modelbased estimators could be used. Indeed, no estimator is considered to be the best across all ranges of datagenerating processes. In addition, the accuracy of the results was almost the same in all sample sizes.
However, the GLMs estimated population means more precisely in all datagenerating processes and sample sizes. In this respect, our results are consistent with other studies [2,5,6,10]. Comparative studies between log models were evaluated on 1,000 random samples, with a sample size of 10,000. They found almost identical results in estimating the slope β_{1}, but the GLMs were substantially more precise than OLSbased model [5]. In this paper, as the sample size increased, the precision of estimating the mean population and the β_{1} using an OLSbased model became closer to that of GLMs.
Based on our result, the Gamma regression model provided more accurate estimates of population mean. In other studies, which compare log and Cox proportional hazard models, the Gamma regression model was introduced as the reasonable model across all of the simulation processes [13]. They have also found that the Cox proportional hazard model exhibited good performance when data were generated by distribution with a proportional hazards assumption [13]. In this paper, a Weibull distribution was selected as the proportional hazard datagenerating mechanism. In addition, investigating proportional hazards assumption detected that gamma generation process also has produced data with proportional hazard properties but the Cox proportional hazard model showed a poor result within these data generation process. We also found that the Cox proportional hazard model behaved poorly in other data generation scenarios.
Our study has some limitations, including the fact that our focus was on generating skewed data, while kurtosis may have affected the results. Furthermore, the study was limited to fixed covariates.
Conclusions
Selecting the best model is dependent on the interest point of research, which could be the estimated mean of the population or covariate effects. There is no best model among all data conditions. It seems that the GLMs, especially the Gamma regression model, behave well regarding the estimation of population means of healthcare costs in most of the conditions. The results are consistent among all sample sizes; however, increasing sample size leads to improvement in the performance of the OLSbased model.
Based on estimation of the β_{1}, GLMs seems to provide plausible estimations and as the sample size increased, estimated the β1 more precisely in all datagenerating processes. Under all data generation, process even proportional hazard data generation scenarios the Cox proportional hazard model provided a poor estimation of mean population and the β_{1}.
Abbreviations
 Prob. H.L.:

Hosmer–Lemeshow test
 signif.:

At the 5% level
References
 1.
Gregori D, Petrinco M, Bo S, Desideri A, Merletti F, Pagano E. Regression models for analyzing costs and their determinants in health care: an introductory review. Int J Qual Health Care. 2011;23(3):331–41.
 2.
Mihaylova B, Briggs A, O’Hagan A, Thompson SG. Review of statistical methods for analysis healthcare resources and costs. Health Econ. 2011;20:897–916.
 3.
Gilleskie DB, Mroz TA. A flexible approach for estimating the effects of covariates on health expenditures. J Health Econ. 2004;23:391–418.
 4.
Cantoni E, Ronchetti E. A robust approach for skewed and heavytailed outcomes in the analysis of health care expenditures. J Health Econ. 2006;25(2):198–213.
 5.
Manninga WG, Mullahy J. Estimating log models: to transform or not to transform? J Health Econ. 2001;20:461–94.
 6.
Deb P, Burgess JF. A quasiexperimental comparison of econometric models for health care expenditures. New York: Hunter College Department of Economics; 2003. Report No.: 212.
 7.
Basu A, Manning WG. Issues for the next generation of health care cost analyses. Med Care. 2009;47(7):S109–14.
 8.
Ravangard R, Arab M, Rashidian A, Akbarisari A, Zare A, Zeraati H. Comparison of the results of cox proportional hazards model and parametric models in the study of length of stay in a tertiary teaching hospital in Tehran. Iran Acta Med Iran. 2011;49(10):650–8.
 9.
Jain AK, Strawderman RL. Flexible hazard regression modeling for medical cost data. Biostatistics. 2002;3(1):101–18.
 10.
Manning WG, Basu A, Mullahy J. Generalized modeling approaches to risk adjustment of skewed outcomes data. J Health Econ. 2005;24:465–88.
 11.
Faddy M, Graves N, Pettitt A. Modeling length of stay in hospital and other right skewed data: comparison of phasetype, gamma and lognormal distributions. Value Health. 2009;12(2):309–14.
 12.
Griswold M, Parmigiani G, Potosky A, Lipscomb J. Analyzing health care costs: a comparison of statistical methods motivated by Medicare colorectal cancer charges. Biostatistics. 2004;1(1):1–23.
 13.
Basu A, Manning WG, Mullahy J. Comparing alternative models: log vs Cox proportional hazard? Health Econ. 2004;13:749–65.
 14.
Lee ET, Wang JW. Statistical methods for survival data analysis. 3rd ed. New Jersey: John Wiley & Sons; 2003.
Acknowledgements
This study is part of biostatistics MS degree thesis of Fatemeh Pourmotahari and it was supported by the Ahvaz Jundishapur University of Medical Sciences.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ASM contributed to the study design, wrote and revised the manuscript. FP analyzed the data and drafted the manuscript. KAA revised the manuscript. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Malehi, A.S., Pourmotahari, F. & Angali, K.A. Statistical models for the analysis of skewed healthcare cost data: a simulation study. Health Econ Rev 5, 11 (2015). https://doi.org/10.1186/s1356101500457
Received:
Accepted:
Published:
Keywords
 Skewed data
 Generalized linear models (GLMs)
 Cox proportional hazard regression
 Ordinary least squares (OLS) model
 Transformation
 Healthcare cost
 Monte Carlo simulation