Confounding and missing data in cost-effectiveness analysis: comparing different methods

Introduction Common approaches in cost-effectiveness analyses do not adjust for confounders. In nonrandomized studies this can result in biased results. Parametric models such as regression models are commonly applied to adjust for confounding, but there are several issues which need to be accounted for. The distribution of costs is often skewed and there can be a considerable proportion of observations of zero costs, which cannot be well handled using simple linear models. Associations between costs and effectiveness cannot usually be explained using observed background information alone, which also requires special attention in parametric modeling. Furthermore, in longitudinal panel data, missing observations are a growing problem also with nonparametric methods when cumulative outcome measures are used. Methods We compare two methods, which can handle the aforementioned issues, in addition to the standard unadjusted bootstrap techniques for assessing cost-effectiveness in the Helsinki Psychotherapy Study based on five repeated measurements of the Global Severity Index (SCL-90-GSI) and direct costs during one year of follow-up in two groups defined by the Defence Style Questionnaire (DSQ) at baseline. The first method models cumulative costs and effectiveness using generalized linear models, multiple imputation and bootstrap techniques. The second method deals with repeated measurement data directly using a hierarchical two-part logistic and gamma regression model for costs, a hierarchical linear model for effectiveness, and Bayesian inference. Results The adjustment for confounders mitigated the differences of the DSQ groups. Our method, based on Bayesian inference, revealed the unexplained association of costs and effectiveness. Furthermore, the method also demonstrated strong heteroscedasticity in positive costs. Conclusions Confounders should be accounted for in cost-effectiveness analyses, if the comparison groups are not randomized. JEL classification C1; C3; I1

http://www.healtheconomicsreview.com/content/3/1/8 Background Cost-effectiveness analyses are often based on comparing average costs with the effectiveness of the treatments, and on bootstrap methods [1]. Bootstrap methods have been shown to have good properties when compared with parametric methods [2][3][4]. A standard application of this approach does not adjust for possible confounding effects, which can have a considerable influence on the results not only in observational but also in randomised studies [5]. In randomized controlled trials spurious differences in the distributions of background factors can emerge due to random variation especially with small sample sizes. This can induce a need to adjust for confounding. Some statistical methods have been developed to estimate adjusted means using (generalized) linear models, but it seems that these methods have seldom been applied in costeffectiveness analyses. In the predictive margins approach [6,7] the mean of individual predictions based on a generalized linear regression model are calculated, which allows for comparison of different scenarios by modification of the covariate values. In Bayesian inference, predictive distributions (e.g. [8]) can be applied for calculating adjusted means.
Longitudinal data have often been compressed by using different cumulative measures of costs and effectiveness outcomes (e.g. area under curve, AUC). The application of cumulative costs can also reduce the number of observations with zero costs in repeated measurement data and skewness, therefore, reducing the problems related to the application of linear models.
A drawback of cumulative measures is that a missing value in the outcome variable at a given measurement point will result in missing values in the cumulative outcome values of the subsequent measurement points. Multiple imputation (MI) [9] and data augmentation [10] can be applied to deal with missing outcome values at single measurement points, so that all of the information for the observed outcome values can be utilised for the cumulative outcomes.
The distribution of costs is generally skewed, with a proportion of observations having zero costs and the remaining portion having positive costs. Direct application of linear regression models is not suitable for these kinds of data, because if the model is used to predict costs, the predictions could suggest negative costs or other unrealistic results in some scenarios. The usual method of reducing skewness using logarithmic transformation is not sensible when some of the study subjects have zero costs. Two-part models [11], for example, have been proposed for solving this problem.
Missing values may be present not only in the dependent variable but also in the independent variables in the regression model. These variables can be, for example, skewed or categorical. Multiple imputation of missing costs with possibly zero values is more complicated, and two-part models can be more useful. In these cases, methods which assume a multinormal distribution for the variables are not ideal. More suitable methods, which can handle the aforementioned complications, can be based on data augmentation and Bayesian inference and implemented, for example, by using the OpenBUGS software [12,13].
In the case of an observational study and nonintervention-based comparison groups based on, for example, a patient characteristic the cost-effectiveness analysis demonstrates the change of the average costs (relative to the change in the average effectiveness). This information allows a decision-maker to plan more cost-effective interventions, as not only different patientspecific characteristics but also other baseline factors can be compared in terms of the standardized average cost differences.
The Helsinki Psychotherapy Study (HPS) is a randomised clinical trial comparing three therapy treatments [14]. In addition to comparisons of the randomized therapy groups, the data set can be utilized to conduct costeffectiveness analyses by comparing groups defined by nonrandomized baseline factors, in which case the randomization no longer plays any part and confounding factors generally need to be adjusted for. In the present work, we compared two groups based on the Defence Style Questionnaire (DSQ) [15], resulting in the identification of several potential confounders.
The aims of the study were as follows: a) We handle confounding by applying predictive margins in the frequentist inference and predictive distributions in the Bayesian inference to produce adjusted means of cost and effectiveness, and their differences. b) We address missing data at single measurements of a repeated measurement study by using the multiple imputation and data augmentation techniques. c) We assess the unexplained associations between costs and effectiveness by applying Bayesian hierarchical models. d) We handle nonnegative costs by applying a two-part model using a logistic regression model as an indicator of zero or positive costs, and a hierarchical gamma regression model for positive costs to avoid unrealistic negative predictive costs, which can be a result of an application of simple linear models. Effectiveness is analysed by using a hierarchical linear model.

Data
The Helsinki Psychotherapy Study recruited 326 outpatients, aged 20-46 years, from the Helsinki region over the period 1994-2000 [14]. A statement describing explicitly the ethical background to this study and an approval by the Helsinki University Central Hospital's ethics council http://www.healtheconomicsreview.com/content/3/1/8 can also be found in [14] (p. 31). Of these patients, 101 were randomly assigned to short-term psychodynamic psychotherapy, 97 to solution-focused therapy and 128 to long-term psychodynamic psychotherapy. In the present study, we restrict our analyses to the former two groups containing 198 patients having received shortterm therapies. Of these patients, 7 refused to participate after being assigned to the treatment group and 21 discontinued their treatment. The cost-effectiveness analysis of the intervention groups has been reported elsewhere [16].
The measurement points (MP) at which the patients were measured were the baseline, and 3, 7, 9 and 12 months after the start of therapy. The Defence Style Questionnaire (DSQ) was dichotomized using the median value of 4.0 as the threshold. A total of 93 patients had DSQ < 4, 100 had DSQ ≥ 4, and 5 had a missing DSQ value.
The effectiveness measure was the Global Severity Index, SCL-90-GSI, psychiatric symptoms, hereafter abbreviated as GSI [17]. The cost variable of the incremental cost-effectiveness analysis was the direct costs resulting from psychiatric health problems (DCP) during the 12-month follow-up period. The DCP was the sum of seven cost items in euros (€), which were: the costs accruing from 1) the protocol-driven study of psychotherapy treatments, 2) auxiliary study treatment visits, 3) other psychotherapy sessions, 4) outpatient visits to physicians and to other health care personnel concerning mental health problems, 5) inpatient care in psychiatric hospitals or with psychiatric diagnosis, 6) psychotropic medication, and 7) travel costs due to therapy visits. All costs were included in the analysis regardless of the payer. Effectiveness was assessed at baseline and at 3, 7, 9 and 12 months after baseline. Cost data based on information obtained from patients by questionnaires covered periods 0-7 months and 8-12 months whereas cost data based on patient level registers covered periods 0-3, 4-7, 8-9 and 10-12 months. Table 1 presents descriptive statistics of the outcome variables.
The DSQ was associated with several baseline variables, which were also predictors of the effectiveness measure or the costs. The potential confounders in modelling both effectiveness and costs were gender, 'psychiatric diagnostic category on Axis I' (DSM IV) [18], 'IIP-C, total score' [19], 'SOC, Sense of Coherence scale' [20] and 'SAS-SR, work subscale' [21] (Table 2). These potential confounders were chosen based both on a priori judgement on psychology and on Kendall's τ with DSQ (p-value smaller than 0.1), and with the AUC or with DCP (p-value smaller than 0.2) (data not shown). Also, we have applied baseline adjustment of the effectiveness by adding the baseline GSI value in the effectiveness model [22]. Due to randomization, there was no association between therapy group (the intervention) and DSQ, thus the therapy group was not included in the model.

Models
The basic bootstrap method, which did not adjust for confounders, was compared with two model-based approaches, which adjust for confounders by using regression models. The first model-based method was based on cumulative measures of cost and effectiveness as outcomes, multiple imputation, bootstrap, generalized linear models and frequentist inference. The other method was based on hierarchical regression modelling of GSI and DCP at each MP using the Bayesian inference. These latter methods are described below, and referred to as the frequentist model and the Bayesian model although the models were not restricted to any particular inferential paradigm.

Multiple imputation and bootstrap methods
Multiple imputation [9] of the missing data was performed using procedure MI of the Sas System 9.2 [23]. The numerical Markov chain Monte Carlo (MCMC) method of this procedure was chosen because the missing data pattern was not monotonic. The MCMC method, in which the variables of the imputation model were assumed to be multinormally distributed, was applied separately for the DSQ groups. Effectiveness GSI was imputed for the repeated measurement data, and after the imputation the AUC was constructed from the imputed values using equation (4) given in the Appendix. The DCP were log-transformed in the MI. Auxiliary variables were not included in the imputation model due to convergence problems of the EM algorithm, which was used to estimate the initial values of imputation model parameters, of procedure MI.
In order to calculate the confidence intervals, the bootstrap method [1] was applied with 500 bootstrap samples. Multiple imputation was performed as a single imputation separately for each bootstrap sample [24].

Models for cumulative outcomes with frequentist inference
The observed values of the cumulative costs variable were all positive in this case, and a gamma regression model was therefore applied as the analysis model for the cumulative costs. The AUC was modelled using a linear regression model.

Bayesian model for repeated measurement data
The Bayesian method [8] for costs was based on a twopart hierarchical model because there were a considerable number of zero costs in the MP-specific cost data. The costs of patient i at MP k were factored as a product of two terms: C ik = C Z ik C P ik . The first term in the product had a value of one if the costs were positive between MP's k − 1 http://www.healtheconomicsreview.com/content/3/1/8 and k, and zero if no costs had been incurred. A logistic regression model for the binary outcome C Z ik was defined as where X ik denoted the row vector of the intercept, the confounders, MP k and the binary DSQ group indicator DSQ i . β Z k were the corresponding regression coefficients, and exp{β Z k } were the corresponding odds ratios (OR). The second term of the positive costs C P ik was defined in a similar fashion. Because the distribution of the positive costs was skewed, the Gamma distribution was applied.
Random effect U P i was individual. The regression coefficients represented the proportional changes in the expected value of the positive outcomes, i.e. a one-unit increase in the value of a covariate corresponded to a exp{β}-fold increase in the expected value. The possible heteroscedasticity was accounted for by allowing dispersion (inverse of variance-to-mean ratio) parameter τ DSQ i k vary over the DSQ groups and the MPs. Note that if τ DSQ i k = 1 then the expected value of the positive costs was equal to the variance, and if τ DSQ i k > 1 then the variance was smaller.
Effectiveness E ik was modelled by using a linear, hierarchical model: where t is the follow-up time at MP k. There was assumed to be an individual linear trend, which was modelled by the random effects part U E i1 + U E i2 t. The prior distributions of the regression coefficients β · k and the residual error terms E ik were N(0, 100) http://www.healtheconomicsreview.com/content/3/1/8 and N(0, σ 2 E,DSQ i k ), respectively. The variance parameters σ 2 E,·,· and the dispersion parameters τ ·,· were distributed as InverseGamma(3, 1) and Gamma(3, 1) a priori, respectively.
The possible associations between measurement points, and the costs and the effects were accounted for using hierarchical models. The prior distribution for the three random effects were where InvW 10 was an abbreviation for the inverse Wishart distribution with 10 degrees of freedom and I 3 the 3 × 3 identity matrix. Thus the correlation between the random effects was assumed to be zero a priori, and the costs and effectiveness were therefore assumed to be independent given the observed background information and the random effects. The correlation coefficients of the distribution of the random effects were defined as The details of the model specifications, estimation, data augmentation and the model assessment are described below.

Mathematical description of the model Notations
Let i ∈ {1, 2, . . . , n} =: I index the n = 198 subjects and k = 1, 2, . . . , K := 5 the K intervention points where the measurements were made. Let time t(1) ≡ 0 denote the baseline and t(K) = 12 the duration of the follow-up in months. As there was some individual variation in the times when the actual measurements were made, let t * i (k) denote the actual k th measurement time when the outcomes C ik (DCP during (t * i (k − 1), t * i (k)]) and E ik (GSI) of subject i were recorded.
The cumulative cost variable was defined as the sum C Cum i := K k=2 C ik . The effectiveness AUC was defined as: The incremental cost-effectiveness ratio was defined as whereC Cum · andĒ AUC · are the DSQ group specific cumulative cost and effectiveness AUC means, respectively.

Cumulative outcomes
The model adjustment for controlling confounding was based on the ideas of Lee [6]. The individual predictions, which were based on the parameter values, the covariate values and the expected value based on the gamma regression model, were: where X C i denotes the row vector of the covariates and β C the corresponding column vector of the regression coefficients.
The analysis model for the effectiveness contains the group DSQ i and the confounders. The linear regression model was applied, and the individual predictions were: The predicted margin [7] was the average of the predictions (6): In ( (8), but by using the individual predictions defined in (7). The adjusted ICER was calculated by using the predictive margins PM E (·) and PM C (·):

Repeated measurements
In the Bayesian model, the posterior distribution was proportional to the product of the likelihood terms based on equations (1), (2) and (3), the joint density of the random effects and the joint density of all model parameters (denoted here by θ): (10) http://www.healtheconomicsreview.com/content/3/1/8 The predictive distribution of the outcomes for a set I of hypothetical subjects i * was defined as Predictive distributions (11) were applied to calculate posterior predictive expectations and quantile points of functionals of such as PM C (x DSQ ) in (8) or the ICER PM in (9).

Estimation
The frequentist regression analyses were performed using the SAS System 9.2 [23] procedures Genmod and Mixed for cumulative costs and AUC, respectively. The Bayesian analyses were conducted using the Open-Bugs software [13], which applies MCMC methods. The data management and the predictive distributions were handled using the R software [25].

Data augmentation to handle missing data
Predictive distributions (11) were also applied in the data augmentation procedures to handle missing outcome values. The corresponding predictive distribution for the missing baseline covariate value X ij , which belongs to at least one of the covariate vectors X Z ik , X P ik or X E ik , was The prior distribution for the discretized baseline DSQ was defined as Bernoulli(p DSQ ), where the hyperprior for p DSQ was chosen to be Uniform(0, 1). The other baseline covariates X ij having missing values were continuous, and they were assigned N(μ j , 1/τ j ) priors with hyperpriors μ j ∼ N(0, 1000) and τ j ∼ Gamma(2, 1). As the number of missing values in these covariates was small, we chose not to elaborate the prior distributions further.

Convergence checks of MCMC and assessment of model assumptions
Two parallel chains were simulated with 60,000 iterations in addition to 10,000 iterations of burn-in in both chains. The chains were thinned by factor 15 resulting in 4,000 sampled values for both chains. Autocorrelations vanished quickly, which suggests good convergence (data not shown). The Bayesian model was assessed using the Q-Q plots of standardised predictive errors (SPE). A total of 40 subjects denoted by set I * were excluded from the data, and the predictions for these 40 subjects were calculated based on their baseline information and all observed information of the remaining subjects in I ∩ I * C . The SPE was defined as where E Obs i * k denotes the observed value of effectiveness GSI. A similar approach for the positive costs C P ik was applied.

Results
The adjusted absolute differences of the cumulative costs (-161 and -166 based on the frequentist and Bayesian methods, respectively) appeared to be smaller than the unadjusted difference of -309 based on the bootstrap method ( Table 3). The unadjusted difference was weakly significantly different from zero (P-value 0.094) whereas the corresponding adjusted difference based on the frequentist model was not significant (P-value 0.432). The Bayesian method, which modelled the measurement points separately, appeared to produce slightly lower predictive cumulative cost means than the frequentist predictive margins method. The estimates of the difference were close to each others.
The AUC differences for effectiveness were highly significant for the unadjusted bootstrapped difference (-0.38), but the adjusted mean -0.04 based on predictive margins was not significant and close to zero (P-values 0.000 and 0.566, respectively), whereas the predictive mean difference based on the Bayesian inference produced a much smaller difference of -0.03 with credible interval (CrI) (-0.27, 0.20) ( Table 3).
The unadjusted ICER estimate was slightly higher than the adjusted estimate ( Table 3). The standard errors, however, were large. The unadjusted ICER was weakly significant (P-value 0.098), but the adjusted ICER was not significant (P-value 0.772). The adjusted ICER and the http://www.healtheconomicsreview.com/content/3/1/8 The observed and model-based estimates of cumulative costs and effectiveness measured by the area under the curve (AUC). a The observed group means, and their standard deviations (SD) and group differences are not adjusted for confounders. 95% confidence intervals (CI) and tests were based on the t-test. b The bootstrapped group means and differences, and their standard errors (SE) are based on multiply imputed and bootstrapped data but unadjusted for confounding. c The adjusted group means and their differences are based on regression modelling. The rows marked adjusted mean and SE are based on predictive margins, frequentist inference, bootstrap and multiple imputation. d The rows denoted by predictive means and SD are based on the hierarchical Bayesian model, posterior predictive group means and their differences, and the corresponding SDs and 95% credible intervals (CrI). e NA corresponds to not applicable results. f P-values correspond to the null hypothesis "no difference between groups" or "ICER equals zero". g P-values are generally not sensible in Bayesian inference, thus they are not reported.
posterior expectation of the ICER were not applicable, because the denominator of the ICER (the effectiveness difference) was near zero, in which case the ICER was undefined [2].
The cost-effectiveness plane shows that in the group where DSQ ≥ 4 both the costs and the symptoms were on a higher level than in the group where DSQ < 4 ( Figure 1).
Due to the large number of parameters in the regression models defined above, which were merely a technical utility as the objective was to adjust the average costs and effectiveness for possible confounding, we focus on a limited number of parameters of the Bayesian model in the following.
There were significant correlations between average levels of individual costs and effectiveness over time (posterior expectation E[ ρ 1,3 |data] = 0.32 and CrI (0.00,0.57)), and the individual slope of effectiveness (E[ ρ 1,2 |data] = −0.17 CrI (-0.33, -0.00)), which cannot be explained by the background information included in the regression models. The correlations between the intercept of positive costs, and the slope of effectiveness, however, were practically zero (E[ ρ 1,3 |data] = −0.03).
The dispersions of the positive costs showed considerable variation over the measurement points, but the residual variances of effectiveness were close to each other (Table 4). During the first measurement interval the positive costs varied relatively little (E[ τ ·,2 |data] were large) because the majority of patients took the study treatments, each of which lasted approximately six months and was covered by the first two measurement intervals. Some patients took auxiliary therapies or other psychiatric treatments both during and after the study therapies, which resulted in higher costs, whereas other patients took no auxiliary treatments, which resulted in lower costs. This resulted in greater individual variation during the last measurement interval (E[ τ ·,5 |data] were small).
The posterior predictive checks based on the Q-Q plots showed that the model for costs fitted well for intervals 0 to 3 months and 3 to 7 months, but after that the predictions were too low. The model for effectiveness did not fit well at the last two measurement points (Figures 2 and 3). http://www.healtheconomicsreview.com/content/3/1/8

Discussion
This paper compared the standard bootstrap-based method in a cost-effectiveness analysis with two modelbased methods based on the frequentist and the Bayesian inferences. The skewed, non-negative distribution of cost variables was analysed using the gamma regression and the two-part models, and model adjustment for potential confounding was performed using the predictive margins The parameters controlling the variance of effectiveness and the dispersion of the positive costs in the Bayesian model. a Parameters σ 2 E,·,· control the variance of the effectiveness measure, and parameters τ ·,· (inverse of variance-to-mean ratio) the dispersion of the positive costs in the Bayesian model. Parameter θ is a generic notation representing one of the parameters σ 2 E,·,· or τ ·,· . b The point estimates are posterior expectations E[ θ | d] and also 95% credible intervals (CI) are presented and predictive distributions in the frequentist and the Bayesian inferences, respectively.
The benefit of using the predictive margins method or the predictive distributions of the Bayesian inference to calculate the adjusted averages of costs and effectiveness is that there is that also other link functions than the identity function can be used in regression modeling. For example, Nixon and Thompson [5] applied identity link functions both for gamma distribution of the costs and for the normal distribution of the effectiveness. The average of the covariates over all individuals multiplied by the corresponding regression coefficients were restricted to zero in order to interpret the intercept terms as the group averages. There is no need to make such restrictions or to stick with the identity link function when using predictive margins or Bayesian predictive distributions. Other link functions, which are commonly used in generalized linear models, can be applied with these methods as well thus avoiding the possibility of negative linear predictors (and predictive values) in case of the gamma distribution.
Both methods based on the regression models could have been applied in the case of repeated measurements by using either frequentist or Bayesian inference. However, the prevalence of zero costs during the first two follow-up periods was very low, thus an application of a logistic regression model and frequentist inference in the two-part model was not plausible. Furthermore, application of a joint hierarchical model would have been difficult because effectiveness and positive costs were modelled using different families of distributions, normal and gamma distributions, respectively. The Bayesian method avoided the numerical instabilities by using informative prior distributions for the regression coefficients and the missing data values were augmented during the MCMC simulation in a straightforward manner.
Bayesian model averaging [26,27] has been proposed to handle the skewness of cost variables or for the selection of important predictors, respectively, but simultaneous handling of skewness, heteroscedasticity and adjustment for confounding factors can be challenging. We modelled the heteroscedastic residual variance of the cost variables in the case of the repeated measurement data, which allows the model to adapt to the distribution of the costs. The posterior predictive checks suggested that the model fit was good.
The predictive margins approach for the cumulative outcomes was flexible, and the procedure described in this paper can be extended to the case of zero costs and two-part models.
Olsen and Schafer [28] introduced a model with random effects for both parts of the two-part model for costs. They reported that the random effects were usually correlated. In our study, however, the sample size was considerably smaller and the proportion of zero costs was very http://www.healtheconomicsreview.com/content/3/1/8 low during the first seven months of follow-up. Therefore, application of separate random effects for the zero costs (and possible positive correlation with the random effect(s) of the positive costs) was suppressed.
Potential extensions of the model include, for example, using a higher order random effects model for positive costs and adding a random effect to the logistic model of zero costs. These extensions would, however, require more measurements or more individuals. Another extension of the model would be to handle the skewed distribution of the effectiveness outcome, especially at the later stages of the follow-up by using, for example, the skewed normal distribution [29].
The patient groups were not defined using (randomized) intervention groups, which is the case in most cost-effectiveness analyses, but with the DSQ, which complicates the interpretation of the ICER statistic. The DSQ is a rather stable characteristic of a patient, which cannot be altered by a researcher, whereas interventions in general can be altered. The interpretation of the ICER statistic is the change in average costs per one unit in the change of the average effectiveness measured in terms of the AUC. It is also important to bear in mind that in this work large values of the effectiveness outcome GSI correspond to more severe symptoms (less benefits) whereas in commonly used effectiveness outcomes such as the quality adjusted life years (QALY) large values correspond to greater benefits. Therefore the usual interpretation of the quadrants of the cost-effectiveness plane [30] is reversed with respect to the vertical axis. In Costs from 0 to 3 months 10 20 30 Costs from 3 to 7 months 10 20 30 Costs from 7 to 9 months The measurement intervals were 0-3, 3-7, 7-9 and 9-12 months. The measurements with zero costs were excluded. http://www.healtheconomicsreview.com/content/3/1/8 this case the unadjusted ICER estimates were positive indicating that in the group DSQ¡4 the effectiveness was better and the costs were lower. Our results allow a decision-maker to assess the importance of patient characteristics such as the DSQ in this study. If a patient in the group DSQ≥4 become similar to a patient in the group DSQ¡4, the cost would decrease by 816 euros (the bootstrapped point estimate) per one unit decrease in the AUC based on the GSI on average. However, the adjusted estimates indicate that not only the effectiveness difference but also the cost difference vanished thus it is not reasonable to present such a standardized estimate of the change in costs as the ICER.
Limitations of the proposed methods are mainly related to the various modelling assumptions, but there are few alternatives to parametric models in order to adjust for confounders. For example, the effectiveness measure was nonnegative, but the model was based on normality assumptions, thus predictive effectiveness values can be negative. The poor performance of the effectiveness model was likely to be due to the skewed distribution, which was not accounted for using a model based on normal distribution. SCL-90-GSI can have only positive values, and the reduction in symptoms caused a considerable proportion of observations to lie close to zero. Linearity assumptions or possible interactions were not tested, but could be done in future work.

Conclusion
Our paper demonstrates how to combine several methods for performing cost-effectiveness analyses in observational studies, which are often subject to effects of confounding and missing data. Our results based on regression modelling confirmed that there was a need to adjust for the confounders in this study, thus the standard unadjusted methods based on the bootstrap method, were not adequate. Unadjusted methods showed significant differences between the groups, but the adjustment for confounders showed no the significant differences thus yielding different conclusions. Not all associations between the costs and effectiveness could not, however, be explained by the observed confounders only, thus the hierarchical model showed clear non-zero correlations between the random effects. The OpenBUGS code is available from the corresponding author upon request.