Comparative Analysis of Embedding Models for Hindi-English Code-Mixed University related queries
DOI:
https://doi.org/10.53032/tvcr/2025.v7n2.44Keywords:
Embedding Models, Code-Mixing, Hindi-English, Natural Language ProcessingAbstract
This study presents a comparative analysis of open source embedding models for developing a understanding Hindi-English code-mixed language on university related questions. With the increasing adoption of conversational agents in Indian higher education institutions, there is a need for systems that can effectively process queries containing mixed Hindi and English language elements. This research evaluates the performance of five state-of-the-art embedding models - MuRIL, IndicBERT, XLM-RoBERTa, mBERT, on a custom dataset of university-related Hindi-English code-mixed queries. These models were assessed across key metrics including intent classification accuracy, entity recognition performance, and computational efficiency. The results indicate that MuRIL consistently outperforms other models, achieving 87.3% intent classification accuracy and 84.2% entity recognition F1-score, representing a 12.8% improvement over the other models. Analysis across varying code-mixing levels reveals that MuRIL maintains robust performance even with high mixing indices, while other models show significant degradation. This research provides practical insights for educational institutions seeking to implement linguistically inclusive chatbot systems and contributes to the growing body of knowledge on multilingual NLP applications in educational contexts.
References
Altrabsheh, N., Cocea, M., & Fallahkhair, S. (2019). Smart Learning Environments for Higher Education: Chatbots for Student Support. IEEE Access, 7, 177387-177395.
Bali, K., Sharma, J., Choudhury, M., & Vyas, Y. (2014). "I am borrowing ya mixing?" An Analysis of English-Hindi Code Mixing in Facebook. Proceedings of the First Workshop on Computational Approaches to Code Switching, 116-126.
Bhat, I., Bhat, R. A., Shrivastava, M., & Sharma, D. (2018). Universal Dependency Parsing for Hindi-English Code-Switching. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 987-998.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440-8451.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171-4186.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. arXiv preprint arXiv:2007.01852.
Guzmán, F., Bouamor, H., Baly, R., & Habash, N. (2016). Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 1398-1408.
Khanuja, S., Bansal, S., Mehtani, P., Khosla, S., Dey, A., Gopalan, B., Margam, D.K., Aggarwal, P., Nagipogu, R.T., Dave, S., Gupta, S., Khanna, S.C., Kumar, V., & Talukdar, P. (2021). MuRIL: Multilingual Representations for Indian Languages. arXiv preprint arXiv:2103.10730.
Khanuja, S., Dandapat, S., Srinivasan, A., Sitaram, S., & Choudhury, M. (2020). GLUECoS: An Evaluation Benchmark for Code-Switched NLP. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3575-3585.
Kumar, N., & Bhattacharyya, P. (2021). Adaptive Pre-training for Effective Code-Mixed Natural Language Understanding. Proceedings of the 5th Workshop on Computational Approaches to Linguistic Code-Switching, 29-40.
Chand, R.R., Sharma, N.A. (2023). Development of Bilingual Chatbot for University Related FAQs Using Natural Language Processing and Deep Learning. In: Hsu, CH., Xu, M., Cao, H., Baghban, H., Shawkat Ali, A.B.M. (eds) Big Data Intelligence and Computing. DataCom 2022. Lecture Notes in Computer Science, vol 13864. Springer, Singapore. https://doi.org/10.1007/978-981-99-2233-8_6
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 The Voice of Creative Research

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.