Predicting high health-cost users among people with cardiovascular disease using machine learning and nationwide linked social administrative datasets

Table 3 Sensitivity analysis for various high-cost user thresholds: predictive model performance

Prediction models	30% high-cost users prevalence		20% prevalence (the base case)		10% prevalence		5% prevalence
Prediction models	Sensitivity^a	F1^d	Sensitivity^a	F1^d	Sensitivity^a	F1^d	Sensitivity^a	F1^d
Traditional regression models
All conventional variables (TRM1)^e	17.9%	26.4%	4.9%	9.1%	*	*	*	*
As per TRM1 but no ethnicity variables (TRM2)	16.5%	25.8%	4.9%	9.0%	*	*	*	*
As per TRM2 but no smoking variables (TRM3)	16.3%	25.6%	4.6%	8.6%	*	*	*	*
Machine learning models^f
Random forest	45.2%	49.3%	37.8%	41.2%	29.9%	32.6%	25.6%	28.5%
KNN	45.7%	46.5%	38.0%	39.0%	29.2%	30.1%	25.2%	26.0%
L1-regularised logistic regression	75.2%	50.9%	78.9%	34.5%	72.5%	21.0%	76.2%	25.0%
Classification trees	46.1%	55.3%	19.5%	30.6%	11.4%	19.8%	10.9%	19.5%

Note: ^aResults produced from the model were unstable due to a small number of CVD events in relation to the total observations
^{a, b, c, d, e, f}: see Table 2
The results for the traditional regression model as per TRM3 but no chronic condition variables were not reported as this model had very poor predictive power

ISSN: 2191-1991