Call for Papers
Quick Links
Reducing FastText's Limits Romanized Language Detection
Yashi Bajpai
Student, Computer Science, National P.G. College, Lucknow, India
Author
Aditi Joshi
Student, Computer Science, National P.G. College, Lucknow, India
Author
Mr. Amit Srivastava
Assistant Professor, Computer Science, National P.G. College, Lucknow, India
Author
📌 DOI: https://doi.org/10.63920/tjths.44001
🔑 Keywords: Romanization; Romanized Text; Multilingual Processing; Transliteration; Language Identification; Tokenization in Mixed Scripts; Benchmarking in NLP
📅 Publication Date: 06 October 2025
📜 License:
This work is licensed under a Creative Commons Attribution 4.0 International License
- Share — Copy and Redistribute the material
- Adapt — Remix, Transform, and build upon the material
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Abstract:
To identify the language of a given text, language identification models such as FastText are used often. However, these models frequently have trouble accurately categorizing text that is written in the Roman (Latin) nature but have historically used non-Latin scripts like Hindi, Japanese and Chinese. In our research, we analyze FastText's performance on romanized inputs and find a pattern of misinterpretation into unrelated languages and lower confidence scores. We solve this by implementing a score-based thresholding method, which hides the input's anticipated language label and classifies it as romanized if the confidence score that FastText returns is less than the set threshold (0.5). This threshold-based method increases classification reliability through testing on several languages and romanized inputs. This study identifies a significant weakness in existing language identification systems and suggests a simple, adjustable modification to improve their effectiveness in multilingual, real-world situations.
Download Full PDF Paper
📖 How to Cite
Yashi Bajpai, Aditi Joshi and Mr. Amit Srivastava (2025). Reducing FastText's Limits in Romanized Language Detection. TEJAS J. Technol. Humanit. Sci.,, Vol. 04, Issue 04. https://doi.org/10.63920/tjths.44001
📊 Article Metrics
References
Ansari, Z. Z., Beg, M. M. S., Ahmad, T., Khan, M. J., & Wasim, G. (2021). Language Identification of
Hindi‑English Tweets Using Code‑Mixed BERT.
• absu5530 (202X). langidentification: Language identification using fastText for Romanized Scripts
(GitHub).
• Abhishek Omray (202X). RomanScriptDetect: A tool to identify romanized Indian language text
(GitHub).
• Ayush Kashyap et al. (2025), Design and Implementation of an Intelligent Loan Eligibility System Using
Machine Learning Techniques, TEJAS Journal of Technologies and Humanitarian Science, ISSN-2583
5599, Vol.04, I.02 (2025), https://doi.org/10.63920/tjths.42002
• Bhat, R., et al. (2024). IndicLID: Language Identification for 22 Indic Languages and Romanized
Variants. arXiv preprint arXiv:2504.21540.
• Bali, K., Choudhury, M., Dasgupta, T., & Basu, A. (2014). Automatic Language Identification in Code
Switched Hindi-English Social Media Text. Journal of Open Humanities Data. — Describes the
phenomenon of "Romanagari" and challenges in identifying Hindi written in Latin script on social media.
• Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. In SDAIR.
• Grave, E., et al. (2017). Efficient text classification and representation using FastText. arXiv preprint
arXiv:1708.02709.
• Ghosh, S., Gothe, S. V., Sanchi, C., & Raj Kandur Raja, B. (2021). edATLAS: An Efficient
Disambiguation Algorithm for Texting in Languages with Abugida Scripts. arXiv preprint
arXiv:2101.03916. — Proposes a method to disambiguate typing variants and romanized word forms for
Indic abugida scripts.
• Gupta, P., & Bali, K. (2013). "Challenges in processing code-mixed and romanized scripts: A case study
of Hindi-English." Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics.
• Jamatia, A., Das, A., & Gambäck, B. (2015). Part-of-speech tagging for code-mixed English-Hindi
Twitter and Facebook chat messages. In Proceedings of the International Conference Recent Advances
in Natural Language Processing (RANLP).
• Joshi, R., & Joshi, R. (2020). Evaluating Input Representation for Language Identification in
Hindi‑English Code Mixed Text. arXiv.
• Kakwani, D., et al. (2020). IndicCorp: A multilingual corpus of Indian languages. arXiv preprint
arXiv:2005.08225.
• Kakwani, D., et al. (2023). Bhasha-Abhijnaanam: A Benchmark for Language Identification of Code
Mixed and Romanized Text in Indian Languages. ACL Anthology, ACL 2023.
