Skip to main content

Table 2 Predictive performance of identifying high-cost patients with CVD for traditional regression models versus machine learning models

From: Predicting high health-cost users among people with cardiovascular disease using machine learning and nationwide linked social administrative datasets

Prediction models

Sensitivitya

Specificityb

Positive Predictive Value/ Precisionc

Harmony score (sensitivity & positive predictive value), F1d

AUCe

Traditional regression models

 All conventional variables (TRM1)f

4.9%

99.2%

61.2%

9.1%

0.53

 As per TRM1 but no ethnicity variables (TRM2)

4.9%

99.2%

62.5%

9.0%

0.53

 As per TRM2 but no smoking variables (TRM3)

4.6%

99.2%

61.8%

8.6%

0.53

 As per TRM3 but no chronic condition variables

0.0%

100%

Not calculable

Not calculable

0.53

Machine learning modelsg

 Random forest

37.8%

88.6%

45.2%

41.2%

0.70

 KNN

38.0%

85.9%

40.1%

39.0%

0.45

 L1-regularised logistic regression

78.9%

83.5%

22.1%

34.5%

0.62

 Classification trees

19.5%

98.0%

71.2%

30.6%

0.73

  1. Note: at 20% top high-cost users of all people with CVD
  2. acorrectly identified people with CVD who are high-cost = recall = true positive rates
  3. bcorrectly identified people with CVD who are not high-cost = true negative rates
  4. ctrue positives/false negatives = true positives among people who were predicted positives
  5. d = 2/(1/recall + 1/precision)
  6. eArea under the Receiver Operating Characteristics curve
  7. fage, sex, ethnicity, smoking status, other chronic conditions: diabetes, cancers and traumatic brain injuries
  8. gall variables as per Table 1