Evaluating embedding models for text classification in apartment management
Abstract
The recent proliferation of embedding models has enhanced the accessibility of textual data classification. However, the crucial challenge is evaluating and selecting the most effective embedding model for a specific domain from a vast number of options. In this study, we address this challenge by assessing the performance of embedding models based on their effectiveness in downstream tasks. We analyze consultation records maintained by an apartment management body in South Korea, and convert this textual data into numerical representations using various embedding models. The vectorized text is then categorized using a k-means clustering algorithm. The downstream task, specifically, the classification of consultation records, is evaluated using a quantitative metric (Silhouette score) and qualitative approaches (domain-specific knowledge and visual inspection). The qualitative approaches yield more reliable results than the quantitative approach. These findings are expected to be valuable for the various stakeholders in property management.
Keyword : embedding model, text data, clustering, domain-specific knowledge, apartment management

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
Bajal, E., Katara, V., Bhatia, M., & Hooda, M. (2022). A review of clustering algorithms: Comparison of DBSCAN and K-mean with oversampling and t-SNE. Recent Patents on Engineering, 16(2), 17–31. https://doi.org/10.2174/1872212115666210208222231
Byun, W. J. (2016). Improving transparency in apartment management. Practice & Theory of Civil Law, 19(2), 79–107. https://doi.org/10.21132/minsa.2016.19.2.03
Dash, T., Chitlangia, S., Ahuja, A., & Srinivasan, A. (2022). A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1), Article 1040. https://doi.org/10.1038/s41598-021-04590-0
Eun, N., Kwak, D., Chae, H., & Jee, E. (2015). Roles of housing management support center and development plan. Journal of the Korean Housing Association, 26(6), 169–180. https://doi.org/10.6107/JKHA.2015.26.6.169
García-Ferrero, I., Agerri, R., & Rigau, G. (2021, November). Benchmarking meta-embeddings: What works and what does not. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3957–3972). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.333
Guja, A., Siwiak, M., & Siwiak, M. (2024). Staring data analytics with generative AI and Python. Manning Publications.
Hyun, S. H., & Lee, E. K. (2021). Determining factors of multi-family housing management policy. Letter of Korean Policy Sciences, 25(3), 35–62. https://doi.org/10.31553/kpsr.2021.9.25.3.35
Janecek, A., Gansterer, W., Demel, M., & Ecker, G. (2008, September). On the relationship between feature selection and classification accuracy. In New challenges for feature selection in data mining and knowledge discovery (pp. 90–105). PMLR.
Jaskowiak, P. A., Costa, I. G., & Campello, R. J. (2022). The area under the ROC curve as a measure of clustering quality. Data Mining and Knowledge Discovery, 36(3), 1219–1245. https://doi.org/10.1007/s10618-022-00829-0
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
Kilimci, Z. H., & Akyokuş, S. (2019, September). The evaluation of word embedding models and deep learning algorithms for Turkish text classification. In 2019 4th International Conference on Computer Science and Engineering (UBMK) (pp. 548–553). IEEE. https://doi.org/10.1109/UBMK.2019.8907027
Kim, S. H. (2024). Investigating noise-between-floors crimes and their characteristics. Public Security Research, 38(3), 95–128.
Korean Statistical Information Service. (2021). Population and housing census. Daejeon City.
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the sentence embeddings from pre-trained language models. arXiv.
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., & Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. arXiv.
Liu, Y., Chen, W., Liu, H., Zhang, Y., Zhang, M., & Qu, H. (2024). Biologically plausible sparse temporal word representations. IEEE Transactions on Neural Networks and Learning Systems, 35(11), 16952–16959. https://doi.org/10.1109/TNNLS.2023.3290004
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv.
Morissette, L., & Chartier, S. (2013). The k-means clustering technique: General considerations and implementation in Mathematica. Tutorials in Quantitative Methods for Psychology, 9(1), 15–24. https://doi.org/10.20982/tqmp.09.1.p015
Pareek, J., & Jacob, J. (2021). Data compression and visualization using PCA and T-SNE. In Advances in Information Communication Technology and Computing: Proceedings of AICTC 2019 (pp. 327–337). Springer Singapore. https://doi.org/10.1007/978-981-15-5421-6_34
Platzer, A. (2013). Visualization of SNPs with t-SNE. PloS ONE, 8(2), Article e56883. https://doi.org/10.1371/journal.pone.0056883
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning, 242(1), 29–48.
Rodrawangpai, B., & Daungjaiboon, W. (2022). Improving text classification with transformers and layer normalization. Machine Learning with Applications, 10, Article 100403. https://doi.org/10.1016/j.mlwa.2022.100403
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Senel, L. K., Schick, T., & Schütze, H. (2022). CoDa21: Evaluating language understanding capabilities of NLP models with context-definition alignment. arXiv.
Shahapure, K. R., & Nicholas, C. (2020, October). Cluster quality analysis using silhouette score. In 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) (pp. 747–748). IEEE. https://doi.org/10.1109/DSAA49011.2020.00096
Shin, Y. J., & Lee, C. H. (2022). Factors influencing management expenses and long-term repair plans: Evidence from apartments in Busan. Tax Accounting Research, 72, 31–50.
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). MPNet: Masked and permuted pre-training for language understanding. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 16857–16867). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Yacouby, R., & Axman, D. (2020, November). Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (pp. 79–91). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.eval4nlp-1.9
Yang, B., Yih, W. T., He, X., Gao, J., & Deng, L. (2014). Embedding entities and relations for learning and inference in knowledge bases. arXiv.