TEJAS Journal of Technologies and Humanitarian Science

ISSN : 2583-5599

Open Access | Quarterly | Peer Reviewed Journal

October, 2025 | Volume 04 | Issue 04

Paper 5:Advances and Challenges in Preprocessing Hindi–English Code-Mixed Text for Multilingual NLP

Authors : Shruti Gupta , Lakshya Srivastava , Amit Srivastava and Gaurvi Shukla

Doi: https://doi.org/10.63920/tjths.44005

Abstract

In social media and on-line communication, Hinglish is a code-mixed language between Hindi and English that is widely used in linguistically mixed areas like India. It is informally structured, it transliterates and regularly switches between languages, which poses considerable problems to natural language processing (NLP) systems. Hinglish may not be processed with the traditional preprocessing pipelines that are intended to process monolingual text. The current review offers an in-depth description of Hinglish text preprocessing and linguistic features of this language. It also talks about big datasets, benchmarks and most frequently used preprocessing algorithms like language identification, transliteration, token normalization and multilingual embeddings. The recent developments, such as contextual and code-mixed pretrained models are also mentioned. In spite of this, there are still concerns over data sparsity, annotation inconsistency, transliteration variability, and real-time processing. The paper also discusses the new areas of research, such as adaptive preprocessing systems and multiscript corpora. On the whole, this survey provides useful information on the existing developments and perspectives of strong and culturally sensitive multilingual NLP applications.

Download Full PDF Paper