TEJAS Journal of Technologies and Humanitarian Science

ISSN : 2583-5599

Open Access | Quarterly | Peer Reviewed Journal


Reducing FastText's Limits Romanized Language Detection


Yashi Bajpai
Student, Computer Science, National P.G. College, Lucknow, India

Author

Aditi Joshi
Student, Computer Science, National P.G. College, Lucknow, India

Author

Mr. Amit Srivastava
Assistant Professor, Computer Science, National P.G. College, Lucknow, India

Author


📌 DOI: https://doi.org/10.63920/tjths.44001

🔑 Keywords: Romanization; Romanized Text; Multilingual Processing; Transliteration; Language Identification; Tokenization in Mixed Scripts; Benchmarking in NLP

📅 Publication Date: 06 October 2025

📜 License:

  • Share — Copy and Redistribute the material
  • Adapt — Remix, Transform, and build upon the material
  • The licensor cannot revoke these freedoms as long as you follow the license terms.

Abstract:

To identify the language of a given text, language identification models such as FastText are used often. However, these models frequently have trouble accurately categorizing text that is written in the Roman (Latin) nature but have historically used non-Latin scripts like Hindi, Japanese and Chinese. In our research, we analyze FastText's performance on romanized inputs and find a pattern of misinterpretation into unrelated languages and lower confidence scores. We solve this by implementing a score-based thresholding method, which hides the input's anticipated language label and classifies it as romanized if the confidence score that FastText returns is less than the set threshold (0.5). This threshold-based method increases classification reliability through testing on several languages and romanized inputs. This study identifies a significant weakness in existing language identification systems and suggests a simple, adjustable modification to improve their effectiveness in multilingual, real-world situations.

Download Full PDF Paper


📖 How to Cite

Yashi Bajpai, Aditi Joshi and Mr. Amit Srivastava (2025). Reducing FastText's Limits in Romanized Language Detection. TEJAS J. Technol. Humanit. Sci.,, Vol. 04, Issue 04. https://doi.org/10.63920/tjths.44001

📊 Article Metrics

👁️ Views: 22
📥 Downloads: 8

References

Ansari, Z. Z., Beg, M. M. S., Ahmad, T., Khan, M. J., & Wasim, G. (2021). Language Identification of Hindi‑English Tweets Using Code‑Mixed BERT.
• absu5530 (202X). langidentification: Language identification using fastText for Romanized Scripts (GitHub).
• Abhishek Omray (202X). RomanScriptDetect: A tool to identify romanized Indian language text (GitHub).
• Ayush Kashyap et al. (2025), Design and Implementation of an Intelligent Loan Eligibility System Using Machine Learning Techniques, TEJAS Journal of Technologies and Humanitarian Science, ISSN-2583 5599, Vol.04, I.02 (2025), https://doi.org/10.63920/tjths.42002
• Bhat, R., et al. (2024). IndicLID: Language Identification for 22 Indic Languages and Romanized Variants. arXiv preprint arXiv:2504.21540.
• Bali, K., Choudhury, M., Dasgupta, T., & Basu, A. (2014). Automatic Language Identification in Code Switched Hindi-English Social Media Text. Journal of Open Humanities Data. — Describes the phenomenon of "Romanagari" and challenges in identifying Hindi written in Latin script on social media.
• Cavnar, W. B., & Trenkle, J. M. (1994). N-Gram-Based Text Categorization. In SDAIR.
• Grave, E., et al. (2017). Efficient text classification and representation using FastText. arXiv preprint arXiv:1708.02709.
• Ghosh, S., Gothe, S. V., Sanchi, C., & Raj Kandur Raja, B. (2021). edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts. arXiv preprint arXiv:2101.03916. — Proposes a method to disambiguate typing variants and romanized word forms for Indic abugida scripts.
• Gupta, P., & Bali, K. (2013). "Challenges in processing code-mixed and romanized scripts: A case study of Hindi-English." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.
• Jamatia, A., Das, A., & Gambäck, B. (2015). Part-of-speech tagging for code-mixed English-Hindi Twitter and Facebook chat messages. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP).
• Joshi, R., & Joshi, R. (2020). Evaluating Input Representation for Language Identification in Hindi‑English Code Mixed Text. arXiv.
• Kakwani, D., et al. (2020). IndicCorp: A multilingual corpus of Indian languages. arXiv preprint arXiv:2005.08225.
• Kakwani, D., et al. (2023). Bhasha-Abhijnaanam: A Benchmark for Language Identification of Code Mixed and Romanized Text in Indian Languages. ACL Anthology, ACL 2023.