Do health economic evaluations using observational data provide reliable assessment of treatment effects?

Economic evaluation in modern health care systems is seen as a transparent scientific framework that can be used to advance progress towards improvements in population health at the best possible value. Despite the perceived superiority that trial-based studies have in terms of internal validity, economic evaluations often employ observational data. In this review, the interface between econometrics and economic evaluation is explored, with emphasis placed on highlighting methodological issues relating to the evaluation of cost-effectiveness within a bivariate framework. Studies that satisfied the eligibility criteria exemplified the use of matching, regression analysis, propensity scores, instrumental variables, as well as difference-in-differences approaches. All studies were reviewed and critically appraised using a structured template. The findings suggest that although state-of-the-art econometric methods have the potential to provide evidence on the causal effects of clinical and policy interventions, their application in economic evaluation is subject to a number of limitations. These range from no credible assessment of key assumptions and scarce evidence regarding the relative performance of different methods, to lack of reporting of important study elements, such as a summary outcome measure and its associated sampling uncertainty. Further research is required to better understand the ways in which observational data should be analysed in the context of the economic evaluation framework.

A potential selection bias may also have been introduced by limiting the sample to only those patients who did not start another treatment within the first 60 days after initial treatment, which may account for sicker, less-stable patients in the sample. Coyte et al.
Propensity Score Analysis

Costs and Effects
Multiple regression analysis for some costs. Logistic regression with two-way interaction terms employed to evaluate propensity scores. Stratification by propensity scores followed. Individual pairwise comparisons for multiple treatments.
Not stated SAS 6.11 None reported Authors stated that their results complement a recent national study.
While several alternative analyses were conducted to control for potential bias in the assignment of patients to various discharge destinations, the possibility that the adjustments were deficient in some respects could not be ruled out. A strong point of the study was the use of propensity score matching, Propensity score models cannot adjust for inadequately measured or unmeasured covariates. It is possible that unmeasured factors were the actual cause of the mortality benefit and not the ICDs themselves. The method of selecting controls was biased, by design, toward inclusion of patients who were "healthier" than typical device recipients. Qualitative with a systematic reviews, a meta-analyses and trials for costs and health outcomes The natural experiment with patient matching, but without patient choice, addresses the important problem of selection bias. Use of time series data and fixed effects multiple regression allowed for correction for time trends between the groups and for unmeasured differences between the individuals in the two groups. Unadjusted and propensity score based regression-adjusted, matched and stratified estimates All four approaches led to the same conclusion. However, the estimates obtained after adjustment were considerably different than those from the unadjusted analysis. Acknowledgement of limitations of propensity score analysis based on administrative data and the selection on observables assumption.
followed by (1) stratification in 5 strata based on propensity score, (2) nearest neighbour 1:1 matching within a calliper of 0.25 standard deviations of the propensity score, or (3)  Effort was made in the analysis to take into account all three mechanisms operating in a person who defers HAART in an observational setting, with selection bias potentially being one of those. Propensity score analysis eliminated imbalances.
Separate logit models for propensity score. Some variables were transformed by taking, respectively, the square root and the log base 10 to correct the skewed distributions of these variables.
Stratification based on propensity scores in 4-5 strata. For consequences: Cox proportional hazards models stratified by propensity score blocks. Costs were computed as weighted sums of the differences between sample mean annual costs by treatment status within each propensity score block, with weights equal to the proportion of observations falling in each block. There is considerable selection bias in the observational data that diminishes as the selection correction methods are applied. Results using regression and propensity score analysis were similar but there were large differences with the instrumental variable approach. Either hidden bias is very important or the instruments used were weak. There is potential for bias in the estimates of treatment effects because of endogeneity in treatment selection. The use of a polychotomous selection model to explore the issue of endogeneity in the frequentist framework found evidence of positive sample selection bias.
Regression analysis with polychotomous sample selection was used Two-stage estimation procedure: multinomial logit model for factors associated with selection and a linear regression with the Mill's ratio in the net benefit regression. Time periods were taken into consideration in the analysis through interactions. Evaluation at three levels of willingness to pay values. 40 Shireman, Braman Propensity Score Analysis

Costs and Consequences
Logistic regression for the propensity score.

Not stated
Confidence interval for odds of hospital admission, p-values for length of stay and costs differences.
Results concur with clinical trials for hospitalisations. Results also in line with most modelling studies. For the latter ranges provided.
Propensity score matching eliminated most of the differences. Authors acknowledge limitations of this approach with respect to unobserved bias.
Stratification of treated cases in 5 groups based on the propensity score. 1:1 matching within groups followed. Logistic regression for probability of any RSV admission (controlling for the predicted propensity score). Multivariate regression for difference between the treated and untreated groups' RSV inpatient lengths of stay and costs, controlling for the predicted propensity score. Traditional methods of analysis are not adequate when it comes to assigning treatment effects to the drugs taken by patients when there is a tendency for them to switch their medication frequently. Epoch analysis addresses this issue and is flexible enough to incorporate current methods to address the modelling of skewed cost data, selection bias and sampling and decision-making uncertainty.
NMB: Net monetary benefit, ICER: Incremental cost-effectiveness ratio, CEAC: Cost-effectiveness acceptability curve, ATE: Average treatment effect, DiD: Difference-in-differences Authors note that it is difficult to compare their results with those of other studies. They also note limitations with regards use of claims data particularly the use of proxy measures that can cause bias due to misclassification of the explanatory variables.
Descriptive. Assessment of effectiveness using different definitions In specifications 1-6, the sample was split by diagnosis. In 2-4 larger numbers of covariates.
In 5-7 propensity score matching was used.

None reported.
A systems cost-effectiveness framework was used. Difference-in-difference appropriate for analysis at an aggregate level. Lagged components to account for changes in number of providers or their practices over time were not included due to lack of data, but an interaction term between data wave and managed care was included. The mean balance of the covariates, the propensity score distribution and the type of matching performed was reported. Baseline comparability of the managed care and non-managed care cohorts was reported only with respect to treatment and its success and treatment costs.
Difference-in-difference controls for baseline differences in regression analyses and exogenous changes over time. For potential imbalance in unobserved variables, propensity scores were used to match observations in the experimental and control regions on observables. The propensity score is the likelihood an observation came from an experimental region. 3 Barnette, Swindle Random-effects models treating the intercept as a random variable whose variation is explained by programme characteristics account for the correlation of patients within programmes.
Descriptive. Cost and effectiveness models using a different survey-based definition of staffing intensity and cost.

None reported.
Patients were shown to be comparable in terms of the severity of illness index.

# Study
Justification for Alternative specifications Tests Comments method specification 4 Blanchette et al. (2008) The use of propensity score matching was justified on the grounds of small sample size.
Wilcoxon rank sum tests (continuous variables) and χ 2 tests (categorical variables) for differences in baseline characteristics.
The results based on the regression and propensity score matching were similar. 5 Cakir et al.
Matching used to make groups comparable in important characteristics without knowledge of outcomes.
Variables used for matching were based on previous literature.

None reported.
Mann-Whitney for differences in continuous variables, Fisher's exact test for categorical (two-tailed).
Groups were mostly balanced after matching was performed.
Natural, flexible way of modelling clinical progression and cost accumulation.
Choice of covariates using a backward elimination approach.
Sub-group analysis for the incremental net benefit (not presented). Regression methods combined with decision analytic modelling can lead to more robust analysis but also incorporate additional assumptions. A feature of the semi-Markov model is that it explicitly considers the time spent in each state, in contrast to the Markov model, which has a single timescale, the time from entry into the study. This assumption is relevant in the setting of cost studies. Distribution of covariates in two arms not equal. Also normality assumed for costs. 7 Chen et al.
Functional outcomes and costs among patients of different types of PAC were not directly comparable due to possible selection bias.
Qualitative discussion of the covariates included in the equations.
Ordinary least squares regressions for costs and health outcomes on identified homogenous subgroup of patients.
Scheffe and χ 2 tests. Several specification tests were conducted to test the instrumental variable analysis assumptions.
Authors provided a comprehensive justification regarding the outcome measure used (instead of QALYs). Specification tests provided evidence on the validity of the instruments used. Another selection adjustment technique was used to verify the results and the authors stated that the findings were consistent. Authors stated that they addressed uncertainty for both costs and consequences but the approach used is superseded by more valid methods in the current literature. Authors defended the use of calculating confidence intervals instead of traditional sensitivity analysis. For multiple comparators, the authors used the coefficients estimated from the multinomial logit equation to adjust for selection effects in the ordinary least squares regression model for functional outcomes and costs. Propensity score matching to assure similarities between demographic and prior disease characteristics.
Descriptive. None reported. Categorical variables compared using χ 2 analysis or the Fisher exact test. Unpaired ttest to compare continuous data.
Based on a trial sample size calculation revealed that 54 patients in each group would be required to detect differences with an 80% of the power of the study. Post-match balance of means was reported. The size of the groups compared provides a low statistical power to detect significant differences in some of the outcomes.
Study addresses an important question which would be unethical to assess using a randomised controlled trial.
Descriptive sometimes backed up with literature references.
None reported.
Categorical variables compared using χ 2 analysis or the Fisher exact test. t-tests and ANOVA to compare continuous data.
To estimate the treatment costs and outcomes for the entire patient population, weighted sums of the stratum-specific results were calculated, using standard methods for stratified sampling. Multiple treatments were taken into account using a propensity score for different pairs. Authors claim that this allows different propensity score models for different comparisons. Nevertheless, results obtained may refer to different sub-populations. Groups exhibited some differences in patient characteristics, but authors note that these are unlikely to affect the final results. Authors attempted to use instrumental variable analysis but no suitable instruments were available. Non-significance of interaction effects potentially due to the small sample size. Authors noted that it was uunnecessary to calculate confidence intervals for the net-benefit regression framework because the results for all parameters in the model are significant. They also noted that selection is a more important issue for effects rather than costs because physicians care less about costs. Authors justified the use of EQ-5D to calculate QALYs by stating that a literature review suggests that it is sensitive in detecting changes in quality of life when considering patients with schizophrenia.
13 Dhainaut et al. (2007) Incomparability of the groups in terms of resource use and hence of costs in the initial cohort.
Descriptive. None reported. Standardized differences in each baseline variable between the two groups.
Sample size was designed for cost comparisons. As a result, the study is underpowered to deal with effectiveness issues. Post-match balance was reported. Sample size consisted of 522 patients before matching (151, 176 and 195 in each group) and 453 patients after matching (151 in each group). The non-parametric KS test is more appropriate given the highly non-normal distribution of the cost data. Post-match covariate balance was reported. Genetic matching does not rely on parametric assumptions such as assuming that the baseline costs are normally distributed. It also allows for adjustments of baseline differences across the groups right across the distribution The approach was used to independently match two of the intervention groups to the third. 19 Grieve et al. Descriptive. None reported. None reported. A cohort study for which 90% of unselected consecutive patients were matched to an appropriate rating. Correlation between costs and effects was taken into account using Seemingly Unrelated Regression (SUR). Missing data were imputed using ordinary least squares for length of stay and resource use and chained equations for adjusted analysis and utilities. Imputed datasets allowed for retention of between imputation variance in estimating standard errors. Groups were comparable with respect to their characteristics. 21 Groeneveld et al. (2008) PSM approximates pseudo-randomisation of treatment and controls. It is also a simple and transparent statistical design.
Descriptive. Two different Cox proportional-hazards survival models.
Comparisons between median costs using Wilcoxon rank-sum non-parametric tests.
The initial Cox model included only ICD as a predictor of survival. A subsequent model included ICD receipt, the propensity score, and demographic/clinical characteristics that remained imperfectly balanced between groups across quintiles of propensity scores. Post-match balance of means and covariates was reported. The method of selecting controls was biased, by design, toward inclusion of patients who were "healthier" than typical device recipients. As such, survival in the control groups cannot be compared to survival in the pharmacologic arms of randomised clinical trials. 22 Heaton et al.
The use of propensity scores can reduce selection bias by 90%. Post-match balance was reported. Difference-in-difference assumption was tested indirectly by examining pre-intervention trends in outcomes for the two groups. In results, individual and quarterly fixed effects included in the regression were not reported. 27 Linden et al.
Can reduce selection bias and regression to the mean when randomisation is impractical.
Descriptive. Covariates chosen mainly because they were readily available in the data.
Both stratification and matching was used.
None reported. Authors note that the propensity score technique for DM programme evaluation requires large samples especially when using subclassification, which was not the case in the study. Most subclasses had extremely small number of participants. This leads to great variability to the covariate distribution. Administrative data suffer from lack of accuracy and also had limited variables. Post-match balance of means and the propensity score distributions, were reported. Graphical analysis was also used. 28 Manca, Austin (2008) Propensity score analysis addresses some of the limitations of matching, stratification and regression. Unbiased estimation subject to ignorability.
Descriptive. None reported. Balance was checked with t or Wilcoxon rank sum tests for continuous variables and χ 2 tests for dichotomous variables. Distribution of the propensity score reported before and after matching.
Propensity score methodology could control for observable confounders but not for unobservable confounders. Costs and Effects are adjusted separately but under the same model and therefore correlation is preserved in mean estimate. It is unclear how the correlation might be taken forward to the uncertainty in cost-effectiveness ratio. The comparison between instrumental variables panel method and the least squares approach shows that bias do exist in the latter when estimating incremental costs and outcomes. No evidence that the instruments are not correlated to the unobserved heterogeneity in outcomes. 30 Merito, Pezzoti Propensity scores were used to account for selection bias. The propensity score methodology is one of the techniques recently introduced to address the issue of confounding in observational studies.
Descriptive. Also regressors in the logistic model chosen based on a forwardstepwise procedure.
Various Cox proportional hazards models and OLS models for costs and consequences.
Goodness of fit of logit models by χ 2 and Hosmer-Lemeshow tests. Cox model tested using Schoenfeld residuals and graphic methods.
Tests of the balancing property for the observed covariates in the two groups were restricted to the region of common support for the propensity score. The balancing property was checked using standard statistical tests for the comparison of the difference in means between immediate and deferred patients within each propensity score stratum for continuous covariates, and of the difference in the odds ratios for categorical variables. Post-match balance for means and covariates and post-match distribution of covariates were not reported. Correlation between costs and effects was preserved in regression adjustment using seemingly unrelated regression and in propensity score analysis. A limitation in the propensity score analysis in terms of separate adjustments for each for each separate treatment comparison rather than comparison of all treatment options simultaneously is noted. Cost distributions in both groups were highly skewed with long tails; normality assumption for the net monetary benefit might not be appropriate. Authors note that propensity scores help make the treatment groups comparable with respect to important baseline characteristics. This in turn allows one to obtain more precise estimates of the net monetary benefit. The general linear model framework is useful in conducting subgroup net monetary benefit analysis by introducing a dummy variable for the subgroups and noting the estimate of the corresponding coefficient. Furthermore, this method provides estimates that are best linear unbiased estimates (BLUE) because they are the ordinary least squares solution to the normal regression equation. 33 Mojtabai, Zivin (2003) Propensity score analysis was used to account for selection bias.
The sociodemographic and clinical variables that had shown significant variation across modalities were included.
None reported. F-test and χ 2 test for continuous and categorical data comparison.
Mean balance of covariates in strata following calculation of propensity scores was reported. The cost-effectiveness analysis does not seem to be based on incremental costs and consequences but rather on average costs and consequences and their ratios. 34 Polignano et al. (2008) None provided. Descriptive. None reported. Student's t-test, χ 2 , Fisher exact test.
The matched groups were homogenous in terms of age, sex, coexisted morbidity, type of resection and prevalence of liver cirrhosis. The groups were matched for magnitude of resection and for tumour location and size. After selection of the case-matched controls, the intention-to-treat principle was applied. Authors acknowledge influence of social factors on length of hospital stay.