Skip to main content

Table 3 Sensitivity analysis for various high-cost user thresholds: predictive model performance

From: Predicting high health-cost users among people with cardiovascular disease using machine learning and nationwide linked social administrative datasets

Prediction models

30% high-cost users prevalence

20% prevalence (the base case)

10% prevalence

5% prevalence

Sensitivitya

F1d

Sensitivitya

F1d

Sensitivitya

F1d

Sensitivitya

F1d

Traditional regression models

 All conventional variables (TRM1)e

17.9%

26.4%

4.9%

9.1%

*

*

*

*

 As per TRM1 but no ethnicity variables (TRM2)

16.5%

25.8%

4.9%

9.0%

*

*

*

*

 As per TRM2 but no smoking variables (TRM3)

16.3%

25.6%

4.6%

8.6%

*

*

*

*

Machine learning modelsf

 Random forest

45.2%

49.3%

37.8%

41.2%

29.9%

32.6%

25.6%

28.5%

 KNN

45.7%

46.5%

38.0%

39.0%

29.2%

30.1%

25.2%

26.0%

 L1-regularised logistic regression

75.2%

50.9%

78.9%

34.5%

72.5%

21.0%

76.2%

25.0%

 Classification trees

46.1%

55.3%

19.5%

30.6%

11.4%

19.8%

10.9%

19.5%

  1. Note: aResults produced from the model were unstable due to a small number of CVD events in relation to the total observations
  2. a, b, c, d, e, f: see Table 2
  3. The results for the traditional regression model as per TRM3 but no chronic condition variables were not reported as this model had very poor predictive power