研究院论坛：NO.21A Novel Penalized Log-likelihood Objective Function for Class Imbalance Problem -- 研究院论坛 -- 现代供应链管理研究院

学术活动

当前位置：首页学术活动研究院论坛研究院论坛：NO.21A Novel Penalized Log-likelihood Objective Function for Class Imbalance Problem

研究院论坛：NO.21A Novel Penalized Log-likelihood Objective Function for Class Imbalance Problem

2019年11月19日

报告题目：A Novel Penalized Log-likelihood Objective Function for Class Imbalance Problem

报告人：张丽丽

报告时间：2019年11月19日，10:00-11:30

报告地点：劝学楼223

主办单位：现代供应链管理研究院

【报告人简介】

张丽丽，博士毕业于肯尼索州立大学分析与数据科学专业，硕士毕业于田纳西州诺克斯维尔大学工业工程专业，本科毕业于中南大学电子信息科学与技术专业。目前正在研究如何改善标准统计数据，以及机器学习模型的目的是更好地识别少数（但成本更高）的结果，如果分类错误（例如，破产、洗钱、犯罪、医疗风险、客户反应）。

【摘要】

The log-likelihood function is the optimization objective in the maximum likelihood method for estimating model coefficients. However, its underlying assumption is to maximize the overall accuracy, which does not apply to the imbalanced data existing in many real-world problems (e.g. fraud detection, defective production detection, customer conversion prediction, predictive maintenance, cybersecurity, rare disease diagnoses). The resulted models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating this bias is to penalize the misclassifications of observations differently in the log-likelihood objective function in the learning process. Existing penalized log-likelihood functions require hard hyperparameter estimating or high computational complexity. In the present work, we propose a novel penalized log-likelihood function by including penalty weights for observations in the minority class (i.e. event) as decision variables and learning them from data along with model coefficients. In the experiments, we compared models trained by the proposed log-likelihood function and existing ones, in respects of the statistics of Area under ROC Curve of 100 runs of 10-fold stratified cross validation on 10 public datasets, including 95% confidence interval, mean and standard deviation, as well as the training time. A more detailed analysis was conducted on an imbalanced credit dataset to examine estimated probability distributions and additional performance measures (i.e. Type I error, Type II error and accuracy). The results demonstrate that both discrimination ability and computation efficiency of models are improved by using the proposed log-likelihood function as the learning objective.