Raviraj Joshi


2023

pdf bib
My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks
Tanmay Chavan | Omkar Gokhale | Aditya Kane | Shantanu Patankar | Raviraj Joshi
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib
mahaNLP: A Marathi Natural Language Processing Library
Vidula Magdum | Omkar Jayant Dhekane | Sharayu Sandeep Hiwarkhedkar | Saloni Sunil Mittal | Raviraj Joshi
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib
L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT
Samruddhi Deode | Janhavi Gadre | Aditi Kajale | Ananya Joshi | Raviraj Joshi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning
Raviraj Joshi | Nikesh Garera
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models
Aabha Pingle | Aditya Vyawahare | Isha Joshi | Rahul Tangsali | Raviraj Joshi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic languages
Aishwarya Mirashi | Srushti Sonavane | Purva Lingayat | Tejas Padhiyar | Raviraj Joshi
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3CubeIndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/ l3cube-pune/indic-nlp.

pdf bib
MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
Ruturaj Ghatage | Aditya Ashutosh Kulkarni | Rajlaxmi Patil | Sharvi Endait | Raviraj Joshi
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP.

2022

pdf bib
Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification
Arnav Ladkat | Aamir Miyajiwala | Samiksha Jagadale | Rekha A. Kulkarni | Raviraj Joshi
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Language models are pre-trained using large corpora of generic data like book corpus, com- mon crawl and Wikipedia, which is essential for the model to understand the linguistic characteristics of the language. New studies suggest using Domain Adaptive Pre-training (DAPT) and Task-Adaptive Pre-training (TAPT) as an intermediate step before the final finetuning task. This step helps cover the target domain vocabulary and improves the model performance on the downstream task. In this work, we study the impact of training only the embedding layer on the model’s performance during TAPT and task-specific finetuning. Based on our study, we propose a simple approach to make the in- termediate step of TAPT for BERT-based mod- els more efficient by performing selective pre-training of BERT layers. We show that training only the BERT embedding layer during TAPT is sufficient to adapt to the vocabulary of the target domain and achieve comparable performance. Our approach is computationally efficient, with 78% fewer parameters trained during TAPT. The proposed embedding layer finetuning approach can also be an efficient domain adaptation technique.

pdf bib
A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data
Raviraj Joshi | Anupam Singh
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)

Automatic Speech Recognition(ASR) has been dominated by deep learning-based end-to-end speech recognition models. These approaches require large amounts of labeled data in the form of audio-text pairs. Moreover, these models are more susceptible to domain shift as compared to traditional models. It is common practice to train generic ASR models and then adapt them to target domains using comparatively smaller data sets. We consider a more extreme case of domain adaptation where text-only corpus is available. In this work, we propose a simple baseline technique for domain adaptation in end-to-end speech recognition models. We convert the text-only corpus to audio data using single speaker Text to Speech (TTS) engine. The parallel data in the target domain is then used to fine-tune the final dense layer of generic ASR models. We show that single speaker synthetic TTS data coupled with final dense layer only fine-tuning provides reasonable improvements in word error rates. We use text data from address and e-commerce search domains to show the effectiveness of our low-cost baseline approach on CTC and attention-based models.

pdf bib
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models
Ravindra Nayak | Raviraj Joshi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. Our models show significant improvements over currently available models pre-trained on multiple languages and synthetic code-mixed datasets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp.

pdf bib
L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models
Onkar Litake | Maithili Ravindra Sabane | Parth Sachin Patil | Aparna Abhijeet Ranade | Raviraj Joshi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

pdf bib
L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources
Raviraj Joshi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream Marathi sentiment analysis, text classification, and named entity recognition (NER) tasks. We also release MahaGPT, a generative Marathi GPT model trained on Marathi corpus. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

pdf bib
L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models
Hrushikesh Patil | Abhishek Velankar | Raviraj Joshi
Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022)

Social media platforms are used by a large number of people prominently to express their thoughts and opinions. However, these platforms have contributed to a sub stantial amount of hateful and abusive content as well. Therefore, it is impor tant to curb the spread of hate speech on these platforms. In India, Marathi is one of the most popular languages used by a wide audience. In this work, we present L3Cube-MahaHate, the first ma jor Hate Speech Dataset in Marathi. The dataset is curated from Twitter, anno tated manually. Our dataset consists of over 00 distinct tweets labeled into four major classes i.e hate, offensive, pro fane, and not. We present the approaches used for collecting and annotating the data and the challenges faced during the pro cess. Finally, we present baseline classi fication results using deep learning mod els based on CNN, LSTM, and Transform ers. We explore mono-lingual and multi lingual variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that mono-lingual models per form better than their multi-lingual coun terparts. The MahaBERT model provides the best results on L3Cube-MahaHate Corpus.

2021

pdf bib
L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset
Atharva Kulkarni | Meet Mandhane | Manali Likhitkar | Gayatri Kshirsagar | Raviraj Joshi
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Sentiment analysis is one of the most fundamental tasks in Natural Language Processing. Popular languages like English, Arabic, Russian, Mandarin, and also Indian languages such as Hindi, Bengali, Tamil have seen a significant amount of work in this area. However, the Marathi language which is the third most popular language in India still lags behind due to the absence of proper datasets. In this paper, we present the first major publicly available Marathi Sentiment Analysis Dataset - L3CubeMahaSent. It is curated using tweets extracted from various Maharashtrian personalities’ Twitter accounts. Our dataset consists of ~16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral. We also present the guidelines using which we annotated the tweets. Finally, we present the statistics of our dataset and baseline classification results using CNN, LSTM, ULMFiT, and BERT based models.

2020

pdf bib
Domain Adaptation of NMT models for English-Hindi Machine Translation Task : AdapMT Shared Task ICON 2020
Ramchandra Joshi | Rusbabh Karnavat | Kaustubh Jirapure | Raviraj Joshi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

Recent advancements in Neural Machine Translation (NMT) models have proved to produce a state of the art results on machine translation for low resource Indian languages. This paper describes the neural machine translation systems for the English-Hindi language presented in AdapMT Shared Task ICON 2020. The shared task aims to build a translation system for Indian languages in specific domains like Artificial Intelligence (AI) and Chemistry using a small in-domain parallel corpus. We evaluated the effectiveness of two popular NMT models i.e, LSTM, and Transformer architectures for the English-Hindi machine translation task based on BLEU scores. We train these models primarily using the out of domain data and employ simple domain adaptation techniques based on the characteristics of the in-domain dataset. The fine-tuning and mixed-domain data approaches are used for domain adaptation. The system achieved the second-highest score on chemistry and general domain En-Hi translation task and the third-highest score on the AI domain En-Hi translation task.