INTRODUCTION: Liver cancer is a primary worldwide public health concern, and it is critical to understand the disease's physiology and create therapies. The aim of this study is to classify open access liver cancer data and identify important risk factors with the Random Forest method.
METHODS: The open-access liver cancer dataset was used to construct a predive model in the study. Random Forest was used to classify the disease. Balanced accuracy, accuracy, sensitivity, specificity, positive/negative predictive values were evaluated for model performance. In addition, risk factors were assessed with the logistic regression model.
RESULTS: The accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score metrics obtained with the Random Forest model were 98.9%, 97.9%, 95.8%, 100%, 100%, and 98.3%, and 95.7% respectively. Also, the importance of the variables obtained, the most important risk factors for liver cancer were total proteins, albumin and globulin ratio, albumin, age, total bilirubin, aspartate aminotransferase, direct bilirubin, alamine aminotransferase, alkaline phosphatase, respectively. According to the logistic regression model results, age, direct bilirubin, and albumin variables were statistically sig-nificant.
DISCUSSION AND CONCLUSION: According to the study results, with the machine learning model Random forest used, patients with and without liver cancer were classified with high accuracy, and the importance of the variables related to cancer status was determined. Factors with high variable importance can be considered potential risk factors associated with cancer status and can play an essential role in disease diagnosis.