A Machine Learning-Based Prognostic Stratification Model for Liver Cancer: Results from Survival Studies Using SEER Database

Yuxuan Xiao; Zhuoying Li; Zhuojun Ye; Yu-Xin Zhou; Yixin Zou; Danni Yang; Yuting Tan; Qun Xu; Yongbing Xiang

doi:10.3967/bes2025.143

HTML

Primary liver cancer (PLC) is a major global health challenge, ranking as the sixth most common and third most fatal malignancy worldwide, according to GLOBOCAN 2022 estimates.^[1] This high mortality rate underscores the aggressive nature of the disease and the significant burden it places on global healthcare systems. Although primary prevention remains the cornerstone of liver cancer control, improving outcomes for patients already diagnosed is equally critical for mitigating the impact of the disease. Currently, the international five-year survival rate for PLC is < 20%,^[2] a stark figure that highlights the urgent and pressing need for more precise and reliable prognostic tools. The ability to accurately stratify patients based on their individual risks can directly guide personalized treatment decisions, allowing clinicians to tailor therapies to maximize efficacy while minimizing unnecessary toxicity. Such tools are essential for optimizing therapeutic strategies and preventing both overtreatment in low-risk patients and undertreatment in high-risk individuals, thereby helping control healthcare costs. Furthermore, robust prognostic models can accelerate innovative therapies development by rapidly identifying high-risk populations that are most likely to benefit from inclusion in randomized controlled trials.

Unfortunately, many of the prognostic systems currently in clinical use, such as the Barcelona Clinic Liver Cancer (BCLC), Okuda, and tumor-node-metastasis (TNM) staging systems, are based on methodological frameworks that are now considered outdated. These systems often lack contemporary statistical validation and have significant limitations, such as poor discriminatory capacity for patients with early-stage disease or being validated primarily in specific, often advanced-stage, patient populations. Although new prognostic models have emerged, they face challenges in clinical translation, including the complexity of acquiring the necessary predictive variables, poor applicability across heterogeneous patient populations, and insufficient discriminative power. In recent years, the exponential growth in large-scale medical and health data has created opportunities to overcome these hurdles. Traditional statistical methods are increasingly struggling to meet the demands of complex data analysis. However, machine learning, with its powerful data processing and analytical capabilities, has provided a new paradigm for predicting cancer-related mortality and enabling precise, individualized prevention and control.^[3] Therefore, this study aimed to develop and internally validate prognostic models for 60-month PLC-specific mortality using both traditional statistical and machine-learning approaches.

In this context, this study leveraged comprehensive population-based data from the U.S. Surveillance, Epidemiology, and End Results (SEER) database to analyze the prognostic factors influencing survival in patients with PLC. Our The primary objective was to apply sophisticated machine learning algorithms to construct a prognostic stratification model specifically tailored to the diverse patients with liver cancer represented in this extensive database. The ultimate aim was to provide a clinically relevant tool for prognostic stratification that could serve as an evidence-based foundation for prognostic assessment, treatment selection, and follow-up management. By facilitating better-informed clinical decision-making, it is hoped that this will contribute to improved survival outcomes in patients with liver cancer.

To achieve this, data were extracted from 57,526 patients with PLC diagnosed between 2000 and 2017, with a complete follow-up period of at least 60 months. This study first employed both fine-gray competing risk and Cox proportional hazards regression models to identify key prognostic factors. Subsequently, the entire patient cohort was partitioned into training and test sets, upon which multiple machine-learning models were developed and rigorously compared. The best-performing model was then selected to build an individualized, user-friendly prediction tool for 60-month PLC-specific mortality risk. The variables analyzed in the study were: age (numeric); survival months (numeric); race (white and hispanic, black, Asian or Pacific Islander, American Indian/Alaska Native); year of diagnosis (2000-2005, 2006-2011, 2012-2017); median household income (inflation-adjusted to 2022): ≤90,000/year (household), >90,000/year (household); residential status (counties in metropolitan areas, counties not in metropolitan areas); marital status (married categories, single categories, unknown); histologic type [hepatocellular carcinoma (HCC), intrahepatic cholangiocarcinoma (ICC), other]; stage (localized, regional, distant, unknown/unstaged); surgery (no/unknown, yes); radiotherapy (no/unknown, yes); chemotherapy (no/unknown, yes); time from diagnosis to treatment (≤ 2 months, > 2 months, no/unknown); outcome [alive, dead (attributable to PLC), dead (attributable to other causes)]. The Appendix provides the detailed research methodology, including variable definitions, a full description of the statistical analysis, and the TRIPOD Checklist. The specific coding for each group is shown in eTable 1, where the group with the smallest code is considered the reference group.

Characteristics	Overall	Alive	Dead (attributable to PLC)	Dead (attributable to other causes)
Characteristics	(n =57,526)	(n =8,521)	(n =39,994)	(n =9,011)
Survival (months), Median [Q1, Q3]	10.00 [2.00, 41.00]	96.00 [64.00, 148.00]	6.00 [1.00, 19.00]	15.00 [3.00, 49.00]
Age (years), Median [Q1, Q3]	64.00 [56.00, 73.00]	59.00 [51.00, 66.00]	65.00 [57.00, 74.00]	65.00 [57.00, 75.00]
Diagnosis, n (%)
2000−2005	14,818 (25.76)	1,153 (13.53)	11,168 (27.92)	2,497 (27.71)
2006−2011	20,335 (35.35)	2,543 (29.84)	14,415 (36.04)	3,377 (37.48)
2012−2017	22,373 (38.89)	4,825 (56.62)	14,411 (36.03)	3,137 (34.81)
Sex, n (%)
Female	15,153 (26.34)	2,595 (30.45)	10,302 (25.76)	2,256 (25.04)
Male	42,373 (73.66)	5,926 (69.55)	29,692 (74.24)	6,755 (74.96)
Race, n (%)
White and Hispanic	40,681 (70.72)	5,646 (66.26)	28,620 (71.56)	6,415 (71.19)
Black	6,771 (11.77)	745 (8.74)	4,928 (12.32)	1,098 (12.19)
Asian or Pacific Islander	9,524 (16.56)	2,064 (24.22)	6,046 (15.12)	1,414 (15.69)
American Indian/Alaska Native	550 (0.96)	66 (0.77)	400 (1.00)	84 (0.93)
Income (RMB/year, household), n (%)
≤90,000	41865 (72.78)	5,720 (67.13)	29,599 (74.01)	6,546 (72.64)
>90,000/year	15661 (27.22)	2,801 (32.87)	10,395 (25.99)	2,465 (27.36)
Residential Status, n (%)
Counties in metropolitan areas	51,768 (89.99)	7,956 (93.37)	35,632 (89.09)	8,180 (90.78)
Counties not in metropolitan areas	5,758 (10.01)	565 (6.63)	4,362 (10.91)	831 (9.22)
Marital status, n (%)
Married categories	30,851 (53.63)	4,946 (58.04)	21,179 (52.96)	4,726 (52.45)
Single categories	24,015 (41.75)	3,201 (37.57)	17,002 (42.51)	3,812 (42.30)
Unknown	2,660 (4.62)	374 (4.39)	1,813 (4.53)	473 (5.25)
Histologic Type, n (%)
HCC	48,590 (84.47)	7,177 (84.23)	33,561 (83.92)	7,852 (87.14)
ICC	1,723 (3.00)	60 (0.70)	1,444 (3.61)	219 (2.43)
Other	7,213 (12.54)	1,284 (15.07)	4,989 (12.47)	940 (10.43)
Stage, n (%)
Localized	27,050 (47.02)	6,367 (74.72)	15,314 (38.29)	5,369 (59.58)
Regional	14,487 (25.18)	1,408 (16.52)	11,220 (28.05)	1,859 (20.63)
Distant	10,621 (18.46)	376 (4.41)	9,239 (23.10)	1,006 (11.16)
Unknown/unstaged	5,368 (9.33)	370 (4.34)	4,221 (10.55)	777 (8.62)
Surgery, n (%)
No/unknown	38,694 (67.26)	1,950 (22.88)	31,423 (78.57)	5,321 (59.05)
Yes	18,832 (32.74)	6,571 (77.12)	8,571 (21.43)	3,690 (40.95)
Radiotherapy, n (%)
No/Unknown	53,064 (92.24)	8,101 (95.07)	36,467 (91.18)	8,496 (94.28)
Yes	4,462 (7.76)	420 (4.93)	3,527 (8.82)	515 (5.72)
Chemotherapy, n (%)
No/unknown	37,794 (65.70)	5,342 (62.69)	26,045 (65.12)	6,407 (71.10)
Yes	19,732 (34.30)	3,179 (37.31)	13,949 (34.88)	2,604 (28.90)
Time from diagnosis to treatment, n (%)
≤ 2 months	21,255 (36.95)	4,747 (55.71)	13,236 (33.09)	3,272 (36.31)
> 2 months	10,601 (18.43)	2,348 (27.56)	6,320 (15.80)	1,933 (21.45)
No/Unknown	25,670 (44.62)	1,426 (16.74)	20,438 (51.10)	3,806 (42.24)
Note. PLC, primary liver cancer; HCC, hepatocellular carcinoma; ICC, intrahepatic cholangiocarcinoma.

Table 1. The baseline characteristics of SEER patients with PLC included in the study

Reference (10)

[1]	Bray F, Laversanne M, Sung H, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 2024; 74, 229−63.
[2]	Jiang YF, Li ZY, Ji XW, et al. Global pattern and trend of liver cancer survival: a systematic review of population-based studies. Hepatoma Res, 2020; 6, 52.
[3]	Chakraborty A, Pant MD. Machine learning models for pancreatic cancer survival prediction: a multi-model analysis across stages and treatments using the surveillance, epidemiology, and end results (seer) database. J Clin Med, 2025; 14, 4686.
[4]	Rich NE, Murphy CC, Yopp AC, et al. Sex disparities in presentation and prognosis of 1110 patients with hepatocellular carcinoma. Aliment Pharmacol Ther, 2020; 52, 701−09.
[5]	Kim SY, Song HK, Lee SK, et al. Sex-biased molecular signature for overall survival of liver cancer patients. Biomol Ther (Seoul), 2020; 28, 491−502.
[6]	Iheanacho F, Tramontano A, Manz C. Racial disparities in hepatocellular carcinoma (HCC) treatment and survival: a SEER-medicare analysis. JCO Oncol Pract, 2024; 20, 144.
[7]	Rich NE, Jones PD, Zhu H, et al. Impact of racial, ethnic, and socioeconomic disparities on presentation and survival of HCC: a multicenter study. Hepatol Commun, 2024; 8, e0477.
[8]	Penzkofer L, Gröger LK, Hoppe-Lotichius M, et al. Mixed hepatocellular cholangiocarcinoma: a comparison of survival between mixed tumors, intrahepatic cholangiocarcinoma and hepatocellular carcinoma from a single center. Cancers (Basel), 2023; 15, 639.
[9]	Wei TT, Huang HB, Zhang AJ, et al. Impact of the diagnosis-to-treatment interval on the survival of patients with papillary thyroid cancer. J Invest Surg, 2025; 38, 2456463.
[10]	Gao Y, Liu J, Zhao DX, et al. A novel prognostic model for identifying the risk of hepatocellular carcinoma based on angiogenesis factors. Front Genet, 2022; 13, 857215.

Characteristics	Univariate analysis	P	Multivariable analysis	P
Characteristics	sHR (95%CI)	P	sHR (95%CI)	P
Age (per 1 year)	1.011 (1.011−1.011)^a	< 0.001	1.008 (1.007−1.009)^a	< 0.001
Sex
Male vs. Female	1.050 (1.030−1.080)^a	< 0.001	1.053 (1.027−1.079)^a	< 0.001
Diagnosis year
2006−2011 vs. 2000−2005	1.010 (0.990−1.030)	0.470	0.902 (0.878−0.927)^b	< 0.001
2012−2017 vs. 2000−2005	0.843 (0.822−0.854)^b	< 0.001	0.802 (0.781−0.825)^b	< 0.001
Race
Black vs. White and Hispanic	1.124 (1.092−1.166)^a	< 0.001	1.044 (1.009−1.079)^a	0.013
Asian or Pacific Islander vs. White and Hispanic	0.817 (0.795−0.839)^b	< 0.001	0.916 (0.889−0.943)^b	< 0.001
American Indian/Alaska Native vs. White and Hispanic	1.082 (0.980−1.184)	0.130	1.041 (0.942−1.150)	0.430
Income (RMB/year, household)
>90,000 vs. ≤90,000	0.873 (0.852−0.894)^b	< 0.001	0.968 (0.945−0.992)^b	0.010
Residential Status
Counties not in metropolitan areas vs. Counties in metropolitan areas	1.222 (1.182−1.252)^a	< 0.001	1.126 (1.088−1.166)^a	< 0.001
Marital status
Single categories vs. Married categories	1.102 (1.071−1.123)^a	< 0.001	1.025 (1.002−1.048)^a	0.034
Unknown vs. Married categories	0.993 (0.953−1.043)	0.770	0.875 (0.830−0.922)^b	< 0.001
Histologic Type
ICC vs. HCC	1.742 (1.652−1.842)^a	< 0.001	1.205 (1.133−1.283)^a	< 0.001
Other vs. HCC	1.051 (1.021−1.082)^a	0.001	0.996 (0.963−1.032)	0.840
Stage
Regional vs. Localized	1.342 (1.323−1.371)^a	< 0.001	1.602 (1.562−1.643)^a	< 0.001
Distant vs. Localized	2.314 (2.254−2.373)^a	< 0.001	2.123 (2.057−2.191)^a	< 0.001
Unknown/unstaged vs. Localized	1.402 (1.361−1.452)^a	< 0.001	1.294 (1.246−1.345)^a	< 0.001
Surgery
Yes vs. No/Unknown	0.303 (0.293−0.313)^b	< 0.001	0.408 (0.396−0.420)^b	< 0.001
Radiotherapy
Yes vs. No/Unknown	1.172 (1.142−1.201)^a	< 0.001	0.946 (0.914−0.978)^b	0.001
Chemotherapy
Yes vs. No/Unknown	0.896 (0.875−0.906)^b	< 0.001	0.885 (0.863−0.907)^b	< 0.001
Time from diagnosis to treatment
>2 months vs. < = 2 months	0.593 (0.584−0.612)^b	< 0.001	0.805 (0.784−0.826)^b	< 0.001
No/Unknown vs. < = 2 months	2.131 (2.082−2.171)^a	< 0.001	1.117 (1.083−1.151)^a	< 0.001
Note. a, the interval estimate exceeds 1;b, the interval estimate is below 1. PLC, primary liver cancer; HCC, hepatocellular carcinoma; ICC, intrahepatic cholangiocarcinoma; sHR, sub-hazard ratio.

Characteristics	Univariate analysis	P	Multivariable analysis	P
	HR (95%CI)		aHR (95%CI)
Age (per 1 year)	1.019 (1.018−1.019)^a	< 0.001	1.014 (1.013−1.015)^a	< 0.001
Sex
Male vs. Female	1.075 (1.051−1.099)^a	< 0.001	1.128 (1.102−1.155)^a	< 0.001
Diagnosis year
2006−2011 vs. 2000−2005	0.832 (0.812−0.853)^b	< 0.001	0.892 (0.870−0.915)^b	< 0.001
2012−2017 vs. 2000−2005	0.741 (0.723−0.760)^b	< 0.001	0.793 (0.773−0.813)^b	< 0.001
Race
Black vs. White and Hispanic	1.135 (1.101−1.169)^a	< 0.001	1.108 (1.075−1.143)^a	< 0.001
Asian or Pacific Islander vs. White and Hispanic	0.784 (0.763−0.806)^b	< 0.001	0.891 (0.866−0.917)^b	< 0.001
American Indian/Alaska Native vs. White and Hispanic	1.038 (0.940−1.146)	0.459	1.016 (0.920−1.122)	0.756
Income (RMB/year, household)
>90,000/year(Household) vs. ≤90,000/year(Household)	0.834 (0.815−0.853)^b	< 0.001	0.941 (0.920−0.963)^b	< 0.001
Residential Status
Counties not in metropolitan areas vs. Counties in metropolitan areas	1.247 (1.209−1.287)^a	< 0.001	1.137 (1.101−1.175)^a	< 0.001
Marital status
Single categories vs. Married categories	1.153 (1.130−1.177)^a	< 0.001	1.095 (1.072−1.118)^a	< 0.001
Unknown vs. Married categories	1.103 (1.052−1.157)^a	< 0.001	0.883 (0.841−0.927)^b	< 0.001
Histologic Type
ICC vs. HCC	2.058 (1.952−2.170)^a	< 0.001	1.311 (1.243−1.383)^a	< 0.001
Other vs. HCC	1.077 (1.045−1.109)^a	< 0.001	1.045 (1.013−1.077)^a	0.005
Stage
Regional vs. Localized	2.133 (2.081−2.186)^a	< 0.001	1.807 (1.762−1.853)^a	< 0.001
Distant vs. Localized	4.053 (3.947−4.163)^a	< 0.001	2.784 (2.706−2.864)^a	< 0.001
Unknown/unstaged vs. Localized	2.480 (2.397−2.567)^a	< 0.001	1.263 (1.218−1.309)^a	< 0.001
Surgery
Yes vs. No/Unknown	0.215 (0.210−0.220)^b	< 0.001	0.299 (0.289−0.308)^b	< 0.001
Radiotherapy
Yes vs. No/Unknown	1.144 (1.105−1.184)^a	< 0.001	0.828 (0.797−0.860)^b	< 0.001
Chemotherapy
Yes vs. No/Unknown	0.808 (0.791−0.824)^b	< 0.001	0.750 (0.730−0.770)^b	< 0.001
Time from diagnosis to treatment
>2 months vs. < = 2 months	0.837 (0.820−0.862)^b	< 0.001	0.747 (0.724−0.770)^b	< 0.001
No/Unknown vs. < = 2 months	2.622 (2.564−2.681)^a	< 0.001	1.235 (1.197−1.274)^a	< 0.001
Note. a, the interval estimate exceeds 1;b, the interval estimate is below 1. PLC, primary liver cancer; HCC, hepatocellular carcinoma; ICC, intrahepatic cholangiocarcinoma; HR, hazard ratio; aHR, adjusted hazard ratio.

A Machine Learning-Based Prognostic Stratification Model for Liver Cancer: Results from Survival Studies Using SEER Database

doi: 10.3967/bes2025.143

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related