Towards Disfluency Annotated Corpora for Indian Languages

Chayan Kochar, Vandan Vasantlal Mujadia, Pruthwik Mishra, Dipti Misra Sharma


Abstract
In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages.
Anthology ID:
2024.wildre-1.1
Volume:
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Girish Nath Jha, Sobha L., Kalika Bali, Atul Kr. Ojha
Venues:
WILDRE | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/2024.wildre-1.1
DOI:
Bibkey:
Cite (ACL):
Chayan Kochar, Vandan Vasantlal Mujadia, Pruthwik Mishra, and Dipti Misra Sharma. 2024. Towards Disfluency Annotated Corpora for Indian Languages. In Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation, pages 1–10, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Towards Disfluency Annotated Corpora for Indian Languages (Kochar et al., WILDRE-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wildre-1.1.pdf