A hybrid clustering and boosting tree feature selection (CBTFS) method for credit risk assessment with high-dimensionality

Jianxin Zhu; Xiong Wu; Lean Yu; Xiaoming Zhang

doi:10.3846/tede.2025.23060

DOI: https://doi.org/10.3846/tede.2025.23060

Abstract

To solve the high-dimensional issue in credit risk assessment, a hybrid clustering and boosting tree feature selection method is proposed. In the hybrid methodology, an improved minimum spanning tree model is first used to remove redundant and irrelevant features. Then three embedded feature selection approaches (i.e., Random Forest, XGBoost, and AdaBoost) are used to further enhance the feature-ranking efficiency and obtain better prediction performance by applying the optimal features. For verification purpose, two real-world credit datasets are used to demonstrate the effectiveness of the proposed hybrid clustering and boosting tree feature selection (CBTFS) methodology. Experimental results demonstrated that the proposed method is superior to others classic feature selection methods. This indicates that the proposed hybrid clustering and boosting tree feature selection method can be used as a promising tool for solving high-dimensional issue in credit risk assessment.

First published online 12 February 2025

Keyword : feature selection, high-dimensionality, credit risk, minimum spanning tree

How to Cite

Zhu, J., Wu, X., Yu, L., & Zhang, X. (2025). A hybrid clustering and boosting tree feature selection (CBTFS) method for credit risk assessment with high-dimensionality . Technological and Economic Development of Economy, 1-33. https://doi.org/10.3846/tede.2025.23060

Published in Issue

Feb 12, 2025

Abstract Views

367

PDF Downloads

176

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Avuçlu, E. (2021). A new data augmentation method to use in machine learning algorithms using statistical measurements. Measurement, 180, Article 109577. https://doi.org/10.1016/j.measurement.2021.109577

Baser, F., Koc, O., & Selcuk-Kestel, A. S. (2023). Credit risk evaluation using clustering based fuzzy classification method. Expert Systems with Applications, 223, Article 119882. https://doi.org/10.1016/j.eswa.2023.119882

Belás, J., Smrcka, L., Gavurova, B., & Dvorsky, J. (2018). The impact of social and economic factors in the credit risk management of SME. Technological and Economic Development of Economy, 24(3), 1215–1230. https://doi.org/10.3846/tede.2018.1968

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024

Chaudhuri, A. (2024). Search space division method for wrapper feature selection on high-dimensional data classification. Knowledge-Based Systems, 291, Article 111578. https://doi.org/10.1016/j.knosys.2024.111578

Chowdhury, N. K., Kabir, M. A., Rahman, M. M., & Islam, S. M. S. (2022). Machine learning for detecting COVID-19 from cough sounds: An ensemble-based MCDM method. Computers in Biology and Medicine, 145, Article 105405. https://doi.org/10.1016/j.compbiomed.2022.105405

Costea, A., Ferrara, M., & Serban, F. (2017). An integrated two-stage methodology for optimising the accuracy of performance classification models. Technological and Economic Development of Economy, 23(1), 111–139. https://doi.org/10.3846/20294913.2016.1213196

Gonçalves, T. S., Ferreira, F. A., Jalali, M. S., & Meidutė-Kavaliauskienė, I. (2016). An idiosyncratic decision support system for credit risk analysis of small and medium-sized enterprises. Technological and Economic Development of Economy, 22(4), 598–616. https://doi.org/10.3846/20294913.2015.1074125

Görüş, V., Bahşı, M. M., & Çevik, M. (2024). Machine learning for the prediction of problems in steel tube bending process. Engineering Applications of Artificial Intelligence, 133, Article 108584. https://doi.org/10.1016/j.engappai.2024.108584

Gramegna, A., & Giudici, P. (2021). Shap and LIME: An evaluation of discriminative power in credit risk. Frontiers in Artificial Intelligence, 4, Article 752558. https://doi.org/10.3389/frai.2021.752558

Gunnarsson, B. R., vanden Broucke, S., Baesens, B., Óskarsdóttir, M., & Lemahieu, W. (2021). Deep learning for credit scoring: Do or don’t? European Journal of Operational Research, 295(1), 292–305. https://doi.org/10.1016/j.ejor.2021.03.006

He, H., Zhang, W., & Zhang, S. (2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98, 105–117. https://doi.org/10.1016/j.eswa.2018.01.012

Hu, Y., Zhang, Y., Gao, X., Gong, D., Song, X., Guo, Y., & Wang, J. (2023). A federated feature selection algorithm based on particle swarm optimization under privacy protection. Knowledge-Based Systems, 260, Article 110122. https://doi.org/10.1016/j.knosys.2022.110122

Huang, S., Zhang, J., Yang, C., Gu, Q., Li, M., & Wang, W. (2022). The interval grey QFD method for new product development: Integrate with LDA topic model to analyze online reviews. Engineering Applications of Artificial Intelligence, 114, Article 105213. https://doi.org/10.1016/j.engappai.2022.105213

Jáñez-Martino, F., Alaiz-Rodríguez, R., González-Castro, V., Fidalgo, E., & Alegre, E. (2023). Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach. Applied Soft Computing, 139, Article 110226. https://doi.org/10.1016/j.asoc.2023.110226

Kozodoi, N., Lessmann, S., Papakonstantinou, K., Gatsoulis, Y., & Baesens, B. (2019). A multi-objective approach for profit-driven feature selection in credit scoring. Decision Support Systems, 120, 106–117. https://doi.org/10.1016/j.dss.2019.03.011

Kuo, T., & Wang, K. J. (2022). A hybrid k-prototypes clustering approach with improved sine-cosine algorithm for mixed-data classification. Computers & Industrial Engineering, 169, Article 108164. https://doi.org/10.1016/j.cie.2022.108164

Li, H., & Wang, J. (2023). CAPKM++2.0: An upgraded version of the collaborative annealing power k -means++ clustering algorithm. Knowledge-Based Systems, 262, Article 110241. https://doi.org/10.1016/j.knosys.2022.110241

Li, M., Ma, H., Lv, S., Wang, L., & Deng, S. (2024a). Enhanced NSGA-II-based feature selection method for high-dimensional classification. Information Sciences, 663, Article 120269. https://doi.org/10.1016/j.ins.2024.120269

Li, Q., Zhao, S., He, T., & Wen, J. (2024b). A simple and efficient filter feature selection method via document-term matrix unitization. Pattern Recognition Letters, 181, 23–29. https://doi.org/10.1016/j.patrec.2024.02.025

Liu, F., & Deng, Y. (2021). Determine the number of unknown targets in open world based on elbow method. IEEE Transactions on Fuzzy Systems, 29(5), 986–995. https://doi.org/10.1109/TFUZZ.2020.2966182

Liu, H., Zhang, J., Liu, Q., & Cao, J. (2022). Minimum spanning tree based graph neural network for emotion classification using EEG. Neural Networks: The Official Journal of the International Neural Network Society, 145, 308–318. https://doi.org/10.1016/j.neunet.2021.10.023

Liu, X, Li, Y., Dai, C., & Zhang, H. (2024). A hierarchical attention-based feature selection and fusion method for credit risk assessment. Future Generation Computer Systems, 160, 537–546. https://doi.org/10.1016/j.future.2024.06.036

Macedo, F., Valadas, R., Carrasquinha, E., Oliveira, M. R., & Pacheco, A. (2022). Feature selection using decomposed mutual information maximization. Neurocomputing, 513, 215–232. https://doi.org/10.1016/j.neucom.2022.09.101

Maldonado, S., Pérez, J., & Bravo, C. (2017). Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research, 261(2), 656–665. https://doi.org/10.1016/j.ejor.2017.02.037

Mirjalili, S., & Lewis, A. (2016). The whale optimization algorithm. Advances in Engineering Software, 95, 51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008

Naseriparsa, M., Bidgoli, A. M., & Varaee, T. (2013). A hybrid feature selection method to improve performance of a group of classification algorithms. International Journal of Computer Applications, 69(17), 28–35. https://doi.org/10.5120/12065-8172

Niu, K., Zhang, Z., Liu, Y., & Li, R. (2020). Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending. Information Sciences, 536, 120–134. https://doi.org/10.1016/j.ins.2020.05.040

Norat, R., Wu, A. S., & Liu, X. (2023). Genetic algorithms with self-adaptation for predictive classification of Medicare standardized payments for physical therapists. Expert Systems with Applications, 218, Article 119529. https://doi.org/10.1016/j.eswa.2023.119529

Osanaiye, O., Cai, H., Choo, K. K. R., Dehghantanha, A., Xu, Z., & Dlodlo, M. (2016). Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. EURASIP Journal on Wireless Communications and Networking, 2016, Article 130. https://doi.org/10.1186/s13638-016-0623-3

Ouaderhman, T., Chamlal, H., & Janane, F. Z. (2024). A new filter-based gene selection approach in the DNA microarray domain. Expert Systems with Applications, 240, Article 122504. https://doi.org/10.1016/j.eswa.2023.122504

Palma-Mendoza, R. J., Rodriguez, D., & de-Marcos, L. (2018). Distributed relieff-based feature selection in Spark. Knowledge and Information Systems, 57(1), 1–20. https://doi.org/10.1007/s10115-017-1145-y

Pashaei, E., & Pashaei, E. (2022). Hybrid binary arithmetic optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical data. The Journal of Supercomputing, 78(13), 15598–15637. https://doi.org/10.1007/s11227-022-04507-2

Qian, H., Wang, B., Yuan, M., Gao, S., & Song, Y. (2022). Financial distress prediction using a corrected feature selection measure and gradient boosted decision tree. Expert Systems with Applications, 190, Article 116202. https://doi.org/10.1016/j.eswa.2021.116202

Rao, C., Liu, M., Goh, M., & Wen, J. (2020). 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Applied Soft Computing, 95, Article 106570. https://doi.org/10.1016/j.asoc.2020.106570

Ros, F., Riad, R., & Guillaume, S. (2023). PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing, 528, 178–199. https://doi.org/10.1016/j.neucom.2023.01.043

Sahu, B., & Dash, S. (2024). Optimal feature selection from high-dimensional microarray dataset employing hybrid IG-Jaya model. Current Materials Science, 17(1), 21–43. https://doi.org/10.2174/2666145416666230124143912

Said, R., Elarbi, M., Bechikh, S., Coello Coello, C. A., & Said, L. B. (2023). Discretization-based feature selection as a bilevel optimization problem. IEEE Transactions on Evolutionary Computation, 27(4), 893–907. https://doi.org/10.1109/TEVC.2022.3192113

Sankhwar, S., Gupta, D., Ramya, K. C., Sheeba Rani, S., Shankar, K., & Lakshmanaprabu, S. K. (2020). Improved grey wolf optimization-based feature subset selection with fuzzy neural classifier for financial crisis prediction. Soft Computing, 24(1), 101–110. https://doi.org/10.1007/s00500-019-04323-6

Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., & Alonso-Betanzos, A. (2017). Ensemble feature selection: Homogeneous and heterogeneous approaches. Knowledge-Based Systems, 118, 124–139. https://doi.org/10.1016/j.knosys.2016.11.017

Song, Q., Ni, J., & Wang, G. (2013). A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 25(1), 1–14. https://doi.org/10.1109/TKDE.2011.181

Sun, J., Lee, Y.-C., Li, H., & Huang, Q.-H. (2015). Combining B&B-based hybrid feature selection and the imbalance-oriented multiple-classifier ensemble for imbalanced credit risk assessment. Technological and Economic Development of Economy, 21(3), 351–378. https://doi.org/10.3846/20294913.2014.884024

Tran, B., Xue, B., & Zhang, M. (2019). Variable-length particle swarm optimization for feature selection on high-dimensional classification. IEEE Transactions on Evolutionary Computation, 23(3), 473–487. https://doi.org/10.1109/TEVC.2018.2869405

Tsafrir, T., Cohen, A., Nir, E., & Nissim, N. (2023). Efficient feature extraction methodologies for unknown MP4-Malware detection using machine learning algorithms. Expert Systems with Applications, 219, Article 119615. https://doi.org/10.1016/j.eswa.2023.119615

Tsai, C. F., Sue, K. L., Hu, Y. H., & Chiu, A. (2021). Combining feature selection, instance selection, and ensemble classification techniques for improved financial distress prediction. Journal of Business Research, 130, 200–209. https://doi.org/10.1016/j.jbusres.2021.03.018

Tsai, C. F., Chen, K. C., & Lin, W. C. (2024). Feature selection and its combination with data over-sampling for multi-class imbalanced datasets. Applied Soft Computing, 153, Article 111267. https://doi.org/10.1016/j.asoc.2024.111267

Wang, H., & Hong, M. (2015). Distance variance score: an efficient feature selection method in text classification. Mathematical Problems in Engineering, 2015, 1–10. https://doi.org/10.1155/2015/695720

Wang, D., Tan, D., & Liu, L. (2018a). Particle swarm optimization algorithm: An overview. Soft Computing, 22(2), 387–408. https://doi.org/10.1007/s00500-016-2474-6

Wang, D., Zhang, Z., Bai, R., & Mao, Y. (2018b). A hybrid system with filter approach and multiple population genetic algorithm for feature selection in credit scoring. Journal of Computational and Applied Mathematics, 329, 307–321. https://doi.org/10.1016/j.cam.2017.04.036

Xie, Y., Peng, L., Chen, Z., Yang, B., Zhang, H., & Zhang, H. (2019). Generative learning for imbalanced data using the Gaussian mixed model. Applied Soft Computing, 79, 439–451. https://doi.org/10.1016/j.asoc.2019.03.056

Yang, G., Deng, S., Chen, X., Chen, C., Yang, Y., Gong, Z., & Hao, Z. (2023). RESKM: A general framework to accelerate large-scale spectral clustering. Pattern Recognition, 137, Article 109275. https://doi.org/10.1016/j.patcog.2022.109275

Yu, L., Yu, L., & Yu, K. (2021). A high-dimensionality-trait-driven learning paradigm for high dimensional credit classification. Financial Innovation, 7, Article 32. https://doi.org/10.1186/s40854-021-00249-x

Yu, L., Zhang, X., & Yin, H. (2022). An extreme learning machine based virtual sample generation method with feature engineering for credit risk assessment with data scarcity. Expert Systems with Applications, 202, Article 117363. https://doi.org/10.1016/j.eswa.2022.117363

Yun, K. K., Yoon, S. W., & Won, D. (2021). Prediction of stock price direction using a hybrid GA-XGBoost algorithm with a three-stage feature engineering process. Expert Systems with Applications, 186, Article 115716. https://doi.org/10.1016/j.eswa.2021.115716

Zhang, X., Wu, G., Dong, Z., & Crawford, C. (2015). Embedded feature-selection support vector machine for driving pattern recognition. Journal of the Franklin Institute, 352(2), 669–685. https://doi.org/10.1016/j.jfranklin.2014.04.021

Zhang, X., Yu, L., Yin, H., & Lai, K. K. (2022). Integrating data augmentation and hybrid feature selection for small sample credit risk assessment with high dimensionality. Computers & Operations Research, 146, Article 105937. https://doi.org/10.1016/j.cor.2022.105937

Zhang, X., & Yu, L. (2024). Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods. Expert Systems with Applications, 237, Article 121484. https://doi.org/10.1016/j.eswa.2023.121484

Zhao, B., Yang, D., Karimi, H. R., Zhou, B., Feng, S., & Li, G. (2023). Filter-wrapper combined feature selection and adaboost-weighted broad learning system for transformer fault diagnosis under imbalanced samples. Neurocomputing, 560, Article 126803. https://doi.org/10.1016/j.neucom.2023.126803

Zhu, J., Wu, X., Yu, L., & Ji, J. (2024). Improved RBM‐based feature extraction for credit risk assessment with high dimensionality. International Transactions in Operational Research, (2024), 1–26. https://doi.org/10.1111/itor.13467

Zorarpaci, E. (2024). A fast intrusion detection system based on swift wrapper feature selection and speedy ensemble classifier. Engineering Applications of Artificial Intelligence, 133, Article 108162. https://doi.org/10.1016/j.engappai.2024.108162