TEJAS Journal of Technologies and Humanitarian Science

ISSN : 2583-5599

Open Access | Quarterly | Peer Reviewed Journal


Advances and Challenges in Preprocessing Hindi–English Code-Mixed Text for Multilingual NLP


Shruti Gupta
Scholar, Computer Science Department, National P.G. College, Lucknow, India

Author

Lakshya Srivastava
Scholar, Computer Science Department, National P.G. College, Lucknow, India

Author

Amit Srivastava
Assistant Professor, Computer Science Department, National P.G. College, Lucknow, India

Author

Gaurvi Shukla
Assistant Professor, Computer Science Department, National P.G. College, Lucknow, India

Author


📌 DOI: https://doi.org/10.63920/tjths.44005

🔑 Keywords: Hinglish; CodeMixed Text; Text Preprocessing; Language Identification; Transliteration

📅 Publication Date: 06 October 2025

📜 License:

  • Share — Copy and Redistribute the material
  • Adapt — Remix, Transform, and build upon the material
  • The licensor cannot revoke these freedoms as long as you follow the license terms.

Abstract:

In social media and on-line communication, Hinglish is a code-mixed language between Hindi and English that is widely used in linguistically mixed areas like India. It is informally structured, it transliterates and regularly switches between languages, which poses considerable problems to natural language processing (NLP) systems. Hinglish may not be processed with the traditional preprocessing pipelines that are intended to process monolingual text. The current review offers an in-depth description of Hinglish text preprocessing and linguistic features of this language. It also talks about big datasets, benchmarks and most frequently used preprocessing algorithms like language identification, transliteration, token normalization and multilingual embeddings. The recent developments, such as contextual and code-mixed pretrained models are also mentioned. In spite of this, there are still concerns over data sparsity, annotation inconsistency, transliteration variability, and real-time processing. The paper also discusses the new areas of research, such as adaptive preprocessing systems and multiscript corpora. On the whole, this survey provides useful information on the existing developments and perspectives of strong and culturally sensitive multilingual NLP applications.

Download Full PDF Paper


📖 How to Cite

Shruti Gupta , Lakshya Srivastava , Amit Srivastava and Gaurvi Shukla (2025). Advances and Challenges in Preprocessing Hindi–English Code-Mixed Text for Multilingual NLP. TEJAS J. Technol. Humanit. Sci.,, Vol. 04, Issue 04. https://doi.org/10.63920/tjths.44005

📊 Article Metrics

👁️ Views: 12
📥 Downloads: 8

References

[1]. Winata, Genta Indra, Alham Fikri Aji, Zheng-Xin Yong, and Thamar Solorio. “The Decades Progress on Code Switching Research in NLP: A Systematic Survey on Trends and Challenges.” Findings of ACL, 2023. [2]. Doğruöz, A. Seçil, et al. “A Survey of Code-Switching: Linguistic and Social Perspectives for Language Technologies.” ACL Long, 2021.
[3]. Khanuja, Simran, et al. “GLUECoS: An Evaluation Benchmark for Code-Switched NLP.” Proceedings of ACL, 2020.
[4]. Aguilar, Gustavo, Sudipta Kar, and Thamar Solorio. “LinCE: A Centralized Benchmark for Linguistic Codeswitching Evaluation.” LREC, 2020.
[5]. Nayak, Ravindra, and Raviraj Joshi. “L3Cube-HingCorpus and HingBERT: A Code-Mixed Hindi-English Dataset and BERT Language Models.” arXiv preprint arXiv:2204.08398, 2022.
[6]. Srivastava, Vivek, and Mayank Singh. “HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text.” Eval4NLP Workshop, 2021.
[7]. Makhija, Piyush, Ankit Kumar, and Anuj Gupta. “hinglishNorm — A Corpus of Hindi-English Code-Mixed Sentences for Text Normalization.” arXiv preprint arXiv:2010.08974, 2020.
[8]. Parikh, Dwija, and Thamar Solorio. “Normalization and Back-Transliteration for Code-Switched Data.” CALCS@NAACL, 2021.
[9]. Gautam, Devansh, et al. “CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences.” CALCS, 2021.
[10]. Gupta, Deepak, Asif Ekbal, and Pushpak Bhattacharyya. “A Semi-supervised Approach to Generate the CodeMixed Text using Pre-trained Encoder and Transfer Learning.” Findings of EMNLP, 2020.
[11]. Patwa, Parth, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. “SemEval-2020 Task 9: Overview of Sentiment Analysis of CodeMixed Tweets (SentiMix 2020).” SemEval Proceedings, 2020.
[12]. Laskar, Sahinur Rahman, Rahul Singh, Shyambabu Pandey, Riyanka Manna, Partha Pakray, and Sivaji Bandyopadhyay. “CNLP-NITS-PP at MixMT 2022: Hinglish–English Code-Mixed Machine Translation.” WMT22, 2022.
[13]. Yadav, K., et al. “Normalization of Spelling Variations in Code-Mixed Data.” ICON 2022, 2022.
[14]. Zhang, Ruochen, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Indra Winata, and Alham Fikri Aji. “Multilingual Large Language Models Are Not (Yet) Code-Switchers.” EMNLP, 2023.
[15]. Vyas, Yogarshi, Spandana Gella, Jatin Sharma, Kalika Bali, and Monojit Choudhury. “POS Tagging of English-Hindi Code-Mixed Social Media Content.” EMNLP, 2014.
[16]. Solorio, Thamar, and Yang Liu. “Learning to Predict Code-Switching Points.” EMNLP, 2008.