A hybrid clustering and boosting tree feature selection (CBTFS) method for credit risk assessment with high-dimensionality
Abstract
To solve the high-dimensional issue in credit risk assessment, a hybrid clustering and boosting tree feature selection method is proposed. In the hybrid methodology, an improved minimum spanning tree model is first used to remove redundant and irrelevant features. Then three embedded feature selection approaches (i.e., Random Forest, XGBoost, and AdaBoost) are used to further enhance the feature-ranking efficiency and obtain better prediction performance by applying the optimal features. For verification purpose, two real-world credit datasets are used to demonstrate the effectiveness of the proposed hybrid clustering and boosting tree feature selection (CBTFS) methodology. Experimental results demonstrated that the proposed method is superior to others classic feature selection methods. This indicates that the proposed hybrid clustering and boosting tree feature selection method can be used as a promising tool for solving high-dimensional issue in credit risk assessment.
First published online 12 February 2025
Keyword : feature selection, high-dimensionality, credit risk, minimum spanning tree
![Creative Commons License](http://i.creativecommons.org/l/by/4.0/88x31.png)
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Baser, F., Koc, O., & Selcuk-Kestel, A. S. (2023). Credit risk evaluation using clustering based fuzzy classification method. Expert Systems with Applications, 223, Article 119882. https://doi.org/10.1016/j.eswa.2023.119882
Belás, J., Smrcka, L., Gavurova, B., & Dvorsky, J. (2018). The impact of social and economic factors in the credit risk management of SME. Technological and Economic Development of Economy, 24(3), 1215–1230. https://doi.org/10.3846/tede.2018.1968
Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024
Chaudhuri, A. (2024). Search space division method for wrapper feature selection on high-dimensional data classification. Knowledge-Based Systems, 291, Article 111578. https://doi.org/10.1016/j.knosys.2024.111578
Chowdhury, N. K., Kabir, M. A., Rahman, M. M., & Islam, S. M. S. (2022). Machine learning for detecting COVID-19 from cough sounds: An ensemble-based MCDM method. Computers in Biology and Medicine, 145, Article 105405. https://doi.org/10.1016/j.compbiomed.2022.105405
Costea, A., Ferrara, M., & Serban, F. (2017). An integrated two-stage methodology for optimising the accuracy of performance classification models. Technological and Economic Development of Economy, 23(1), 111–139. https://doi.org/10.3846/20294913.2016.1213196
Gonçalves, T. S., Ferreira, F. A., Jalali, M. S., & Meidutė-Kavaliauskienė, I. (2016). An idiosyncratic decision support system for credit risk analysis of small and medium-sized enterprises. Technological and Economic Development of Economy, 22(4), 598–616. https://doi.org/10.3846/20294913.2015.1074125
Görüş, V., Bahşı, M. M., & Çevik, M. (2024). Machine learning for the prediction of problems in steel tube bending process. Engineering Applications of Artificial Intelligence, 133, Article 108584. https://doi.org/10.1016/j.engappai.2024.108584
Gramegna, A., & Giudici, P. (2021). Shap and LIME: An evaluation of discriminative power in credit risk. Frontiers in Artificial Intelligence, 4, Article 752558. https://doi.org/10.3389/frai.2021.752558
Gunnarsson, B. R., vanden Broucke, S., Baesens, B., Óskarsdóttir, M., & Lemahieu, W. (2021). Deep learning for credit scoring: Do or don’t? European Journal of Operational Research, 295(1), 292–305. https://doi.org/10.1016/j.ejor.2021.03.006
He, H., Zhang, W., & Zhang, S. (2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98, 105–117. https://doi.org/10.1016/j.eswa.2018.01.012
Hu, Y., Zhang, Y., Gao, X., Gong, D., Song, X., Guo, Y., & Wang, J. (2023). A federated feature selection algorithm based on particle swarm optimization under privacy protection. Knowledge-Based Systems, 260, Article 110122. https://doi.org/10.1016/j.knosys.2022.110122
Huang, S., Zhang, J., Yang, C., Gu, Q., Li, M., & Wang, W. (2022). The interval grey QFD method for new product development: Integrate with LDA topic model to analyze online reviews. Engineering Applications of Artificial Intelligence, 114, Article 105213. https://doi.org/10.1016/j.engappai.2022.105213
Jáñez-Martino, F., Alaiz-Rodríguez, R., González-Castro, V., Fidalgo, E., & Alegre, E. (2023). Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach. Applied Soft Computing, 139, Article 110226. https://doi.org/10.1016/j.asoc.2023.110226
Kozodoi, N., Lessmann, S., Papakonstantinou, K., Gatsoulis, Y., & Baesens, B. (2019). A multi-objective approach for profit-driven feature selection in credit scoring. Decision Support Systems, 120, 106–117. https://doi.org/10.1016/j.dss.2019.03.011
Kuo, T., & Wang, K. J. (2022). A hybrid k-prototypes clustering approach with improved sine-cosine algorithm for mixed-data classification. Computers & Industrial Engineering, 169, Article 108164. https://doi.org/10.1016/j.cie.2022.108164
Li, H., & Wang, J. (2023). CAPKM++2.0: An upgraded version of the collaborative annealing power k -means++ clustering algorithm. Knowledge-Based Systems, 262, Article 110241. https://doi.org/10.1016/j.knosys.2022.110241
Li, M., Ma, H., Lv, S., Wang, L., & Deng, S. (2024a). Enhanced NSGA-II-based feature selection method for high-dimensional classification. Information Sciences, 663, Article 120269. https://doi.org/10.1016/j.ins.2024.120269
Li, Q., Zhao, S., He, T., & Wen, J. (2024b). A simple and efficient filter feature selection method via document-term matrix unitization. Pattern Recognition Letters, 181, 23–29. https://doi.org/10.1016/j.patrec.2024.02.025
Liu, F., & Deng, Y. (2021). Determine the number of unknown targets in open world based on elbow method. IEEE Transactions on Fuzzy Systems, 29(5), 986–995. https://doi.org/10.1109/TFUZZ.2020.2966182
Liu, H., Zhang, J., Liu, Q., & Cao, J. (2022). Minimum spanning tree based graph neural network for emotion classification using EEG. Neural Networks: The Official Journal of the International Neural Network Society, 145, 308–318. https://doi.org/10.1016/j.neunet.2021.10.023
Liu, X, Li, Y., Dai, C., & Zhang, H. (2024). A hierarchical attention-based feature selection and fusion method for credit risk assessment. Future Generation Computer Systems, 160, 537–546. https://doi.org/10.1016/j.future.2024.06.036
Macedo, F., Valadas, R., Carrasquinha, E., Oliveira, M. R., & Pacheco, A. (2022). Feature selection using decomposed mutual information maximization. Neurocomputing, 513, 215–232. https://doi.org/10.1016/j.neucom.2022.09.101
Maldonado, S., Pérez, J., & Bravo, C. (2017). Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research, 261(2), 656–665. https://doi.org/10.1016/j.ejor.2017.02.037
Mirjalili, S., & Lewis, A. (2016). The whale optimization algorithm. Advances in Engineering Software, 95, 51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008
Naseriparsa, M., Bidgoli, A. M., & Varaee, T. (2013). A hybrid feature selection method to improve performance of a group of classification algorithms. International Journal of Computer Applications, 69(17), 28–35. https://doi.org/10.5120/12065-8172
Niu, K., Zhang, Z., Liu, Y., & Li, R. (2020). Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences, 536, 120–134. https://doi.org/10.1016/j.ins.2020.05.040
Norat, R., Wu, A. S., & Liu, X. (2023). Genetic algorithms with self-adaptation for predictive classification of Medicare standardized payments for physical therapists. Expert Systems with Applications, 218, Article 119529. https://doi.org/10.1016/j.eswa.2023.119529
Osanaiye, O., Cai, H., Choo, K. K. R., Dehghantanha, A., Xu, Z., & Dlodlo, M. (2016). Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP Journal on Wireless Communications and Networking, 2016, Article 130. https://doi.org/10.1186/s13638-016-0623-3
Ouaderhman, T., Chamlal, H., & Janane, F. Z. (2024). A new filter-based gene selection approach in the DNA microarray domain. Expert Systems with Applications, 240, Article 122504. https://doi.org/10.1016/j.eswa.2023.122504
Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed relieff-based feature selection in Spark. Knowledge and Information Systems, 57(1), 1–20. https://doi.org/10.1007/s10115-017-1145-y
Pashaei, E., & Pashaei, E. (2022). Hybrid binary arithmetic optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical data. The Journal of Supercomputing, 78(13), 15598–15637. https://doi.org/10.1007/s11227-022-04507-2
Qian, H., Wang, B., Yuan, M., Gao, S., & Song, Y. (2022). Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190, Article 116202. https://doi.org/10.1016/j.eswa.2021.116202
Rao, C., Liu, M., Goh, M., & Wen, J. (2020). 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Applied Soft Computing, 95, Article 106570. https://doi.org/10.1016/j.asoc.2020.106570
Ros, F., Riad, R., & Guillaume, S. (2023). PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing, 528, 178–199. https://doi.org/10.1016/j.neucom.2023.01.043
Sahu, B., & Dash, S. (2024). Optimal feature selection from high-dimensional microarray dataset employing hybrid IG-Jaya model. Current Materials Science, 17(1), 21–43. https://doi.org/10.2174/2666145416666230124143912
Said, R., Elarbi, M., Bechikh, S., Coello Coello, C. A., & Said, L. B. (2023). Discretization-based feature selection as a bilevel optimization problem. IEEE Transactions on Evolutionary Computation, 27(4), 893–907. https://doi.org/10.1109/TEVC.2022.3192113
Sankhwar, S., Gupta, D., Ramya, K. C., Sheeba Rani, S., Shankar, K., & Lakshmanaprabu, S. K. (2020). Improved grey wolf optimization-based feature subset selection with fuzzy neural classifier for financial crisis prediction. Soft Computing, 24(1), 101–110. https://doi.org/10.1007/s00500-019-04323-6
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., & Alonso-Betanzos, A. (2017). Ensemble feature selection: Homogeneous and heterogeneous approaches. Knowledge-Based Systems, 118, 124–139. https://doi.org/10.1016/j.knosys.2016.11.017
Song, Q., Ni, J., & Wang, G. (2013). A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 25(1), 1–14. https://doi.org/10.1109/TKDE.2011.181
Sun, J., Lee, Y.-C., Li, H., & Huang, Q.-H. (2015). Combining B&B-based hybrid feature selection and the imbalance-oriented multiple-classifier ensemble for imbalanced credit risk assessment. Technological and Economic Development of Economy, 21(3), 351–378. https://doi.org/10.3846/20294913.2014.884024
Tran, B., Xue, B., & Zhang, M. (2019). Variable-length particle swarm optimization for feature selection on high-dimensional classification. IEEE Transactions on Evolutionary Computation, 23(3), 473–487. https://doi.org/10.1109/TEVC.2018.2869405
Tsafrir, T., Cohen, A., Nir, E., & Nissim, N. (2023). Efficient feature extraction methodologies for unknown MP4-Malware detection using machine learning algorithms. Expert Systems with Applications, 219, Article 119615. https://doi.org/10.1016/j.eswa.2023.119615
Tsai, C. F., Sue, K. L., Hu, Y. H., & Chiu, A. (2021). Combining feature selection, instance selection, and ensemble classification techniques for improved financial distress prediction. Journal of Business Research, 130, 200–209. https://doi.org/10.1016/j.jbusres.2021.03.018
Tsai, C. F., Chen, K. C., & Lin, W. C. (2024). Feature selection and its combination with data over-sampling for multi-class imbalanced datasets. Applied Soft Computing, 153, Article 111267. https://doi.org/10.1016/j.asoc.2024.111267
Wang, H., & Hong, M. (2015). Distance variance score: an efficient feature selection method in text classification. Mathematical Problems in Engineering, 2015, 1–10. https://doi.org/10.1155/2015/695720
Wang, D., Tan, D., & Liu, L. (2018a). Particle swarm optimization algorithm: An overview. Soft Computing, 22(2), 387–408. https://doi.org/10.1007/s00500-016-2474-6
Wang, D., Zhang, Z., Bai, R., & Mao, Y. (2018b). A hybrid system with filter approach and multiple population genetic algorithm for feature selection in credit scoring. Journal of Computational and Applied Mathematics, 329, 307–321. https://doi.org/10.1016/j.cam.2017.04.036
Xie, Y., Peng, L., Chen, Z., Yang, B., Zhang, H., & Zhang, H. (2019). Generative learning for imbalanced data using the Gaussian mixed model. Applied Soft Computing, 79, 439–451. https://doi.org/10.1016/j.asoc.2019.03.056
Yang, G., Deng, S., Chen, X., Chen, C., Yang, Y., Gong, Z., & Hao, Z. (2023). RESKM: A general framework to accelerate large-scale spectral clustering. Pattern Recognition, 137, Article 109275. https://doi.org/10.1016/j.patcog.2022.109275
Yu, L., Yu, L., & Yu, K. (2021). A high-dimensionality-trait-driven learning paradigm for high dimensional credit classification. Financial Innovation, 7, Article 32. https://doi.org/10.1186/s40854-021-00249-x
Yu, L., Zhang, X., & Yin, H. (2022). An extreme learning machine based virtual sample generation method with feature engineering for credit risk assessment with data scarcity. Expert Systems with Applications, 202, Article 117363. https://doi.org/10.1016/j.eswa.2022.117363
Yun, K. K., Yoon, S. W., & Won, D. (2021). Prediction of stock price direction using a hybrid GA-XGBoost algorithm with a three-stage feature engineering process. Expert Systems with Applications, 186, Article 115716. https://doi.org/10.1016/j.eswa.2021.115716
Zhang, X., Wu, G., Dong, Z., & Crawford, C. (2015). Embedded feature-selection support vector machine for driving pattern recognition. Journal of the Franklin Institute, 352(2), 669–685. https://doi.org/10.1016/j.jfranklin.2014.04.021
Zhang, X., Yu, L., Yin, H., & Lai, K. K. (2022). Integrating data augmentation and hybrid feature selection for small sample credit risk assessment with high dimensionality. Computers & Operations Research, 146, Article 105937. https://doi.org/10.1016/j.cor.2022.105937
Zhang, X., & Yu, L. (2024). Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods. Expert Systems with Applications, 237, Article 121484. https://doi.org/10.1016/j.eswa.2023.121484
Zhao, B., Yang, D., Karimi, H. R., Zhou, B., Feng, S., & Li, G. (2023). Filter-wrapper combined feature selection and adaboost-weighted broad learning system for transformer fault diagnosis under imbalanced samples. Neurocomputing, 560, Article 126803. https://doi.org/10.1016/j.neucom.2023.126803
Zhu, J., Wu, X., Yu, L., & Ji, J. (2024). Improved RBM‐based feature extraction for credit risk assessment with high dimensionality. International Transactions in Operational Research, (2024), 1–26. https://doi.org/10.1111/itor.13467
Zorarpaci, E. (2024). A fast intrusion detection system based on swift wrapper feature selection and speedy ensemble classifier. Engineering Applications of Artificial Intelligence, 133, Article 108162. https://doi.org/10.1016/j.engappai.2024.108162