Comparative Analysis of Embedding Models for Hindi-English Code-Mixed University related queries

Om Ingale; Dr. Sampada Margaj

doi:10.53032/tvcr/2025.v7n2.44

Authors

Om Ingale Student, Department of Data Science, Mumbai University, Kirti College, India
Dr. Sampada Margaj Assistant professor, Department of Computer Science, Kirti College, India

DOI:

https://doi.org/10.53032/tvcr/2025.v7n2.44

Keywords:

Embedding Models, Code-Mixing, Hindi-English, Natural Language Processing

Abstract

This study presents a comparative analysis of open source embedding models for developing a understanding Hindi-English code-mixed language on university related questions. With the increasing adoption of conversational agents in Indian higher education institutions, there is a need for systems that can effectively process queries containing mixed Hindi and English language elements. This research evaluates the performance of five state-of-the-art embedding models - MuRIL, IndicBERT, XLM-RoBERTa, mBERT, on a custom dataset of university-related Hindi-English code-mixed queries. These models were assessed across key metrics including intent classification accuracy, entity recognition performance, and computational efficiency. The results indicate that MuRIL consistently outperforms other models, achieving 87.3% intent classification accuracy and 84.2% entity recognition F1-score, representing a 12.8% improvement over the other models. Analysis across varying code-mixing levels reveals that MuRIL maintains robust performance even with high mixing indices, while other models show significant degradation. This research provides practical insights for educational institutions seeking to implement linguistically inclusive chatbot systems and contributes to the growing body of knowledge on multilingual NLP applications in educational contexts.

References

Altrabsheh, N., Cocea, M., & Fallahkhair, S. (2019). Smart Learning Environments for Higher Education: Chatbots for Student Support. IEEE Access, 7, 177387-177395.

Bali, K., Sharma, J., Choudhury, M., & Vyas, Y. (2014). "I am borrowing ya mixing?" An Analysis of English-Hindi Code Mixing in Facebook. Proceedings of the First Workshop on Computational Approaches to Code Switching, 116-126. DOI: https://doi.org/10.3115/v1/W14-3914

Bhat, I., Bhat, R. A., Shrivastava, M., & Sharma, D. (2018). Universal Dependency Parsing for Hindi-English Code-Switching. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 987-998. DOI: https://doi.org/10.18653/v1/N18-1090

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440-8451. DOI: https://doi.org/10.18653/v1/2020.acl-main.747

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171-4186.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT Sentence Embedding. arXiv preprint arXiv:2007.01852.

Guzmán, F., Bouamor, H., Baly, R., & Habash, N. (2016). Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 1398-1408.

Khanuja, S., Bansal, S., Mehtani, P., Khosla, S., Dey, A., Gopalan, B., Margam, D.K., Aggarwal, P., Nagipogu, R.T., Dave, S., Gupta, S., Khanna, S.C., Kumar, V., & Talukdar, P. (2021). MuRIL: Multilingual Representations for Indian Languages. arXiv preprint arXiv:2103.10730.

Khanuja, S., Dandapat, S., Srinivasan, A., Sitaram, S., & Choudhury, M. (2020). GLUECoS: An Evaluation Benchmark for Code-Switched NLP. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3575-3585. DOI: https://doi.org/10.18653/v1/2020.acl-main.329

Kumar, N., & Bhattacharyya, P. (2021). Adaptive Pre-training for Effective Code-Mixed Natural Language Understanding. Proceedings of the 5th Workshop on Computational Approaches to Linguistic Code-Switching, 29-40.

Chand, R.R., Sharma, N.A. (2023). Development of Bilingual Chatbot for University Related FAQs Using Natural Language Processing and Deep Learning. In: Hsu, CH., Xu, M., Cao, H., Baghban, H., Shawkat Ali, A.B.M. (eds) Big Data Intelligence and Computing. DataCom 2022. Lecture Notes in Computer Science, vol 13864. Springer, Singapore. https://doi.org/10.1007/978-981-99-2233-8_6 DOI: https://doi.org/10.1007/978-981-99-2233-8_6

Comparative Analysis of Embedding Models for Hindi-English Code-Mixed University related queries

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Indexing

Current Issue

Information

Keywords

People Reached Us!

Images

Impact Factor