Enhancing Masked Word Prediction in Tamil Language Models: A Synergistic Approach Using BERT and SBERT

Viswadarshan R R, Viswaa Selvam S, Felicia Lilian J, Mahalakshmi S


Abstract
This research work presents a novel approach to enhancing masked word prediction and sentence-level semantic analysis in Tamil language models. By synergistically combining BERT and Sentence-BERT (SBERT) models, we leverage the strengths of both architectures to capture the contextual understanding and semantic relationships in Tamil Language sentences. Our methodology incorporates sentence tokenization as a crucial pre-processing step, preserving the grammatical structure and word-level dependencies of Tamil sentences. We trained BERT and SBERT on a diverse corpus of Tamil data, including synthetic datasets, the Oscar Corpus, AI4Bharat Parallel Corpus, and data extracted from Tamil Wikipedia and news websites. The combined model effectively predicts masked words while maintaining semantic coherence in generated sentences. While traditional accuracy metrics may not fully capture the model’s performance, intrinsic and extrinsic evaluations reveal the model’s ability to generate contextually relevant and linguistically sound outputs. Our research highlights the importance of sentence tokenization and the synergistic combination of BERT and SBERT for improving masked word prediction in Tamil sentences.
Anthology ID:
2024.icon-1.55
Volume:
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2024
Address:
AU-KBC Research Centre, Chennai, India
Editors:
Sobha Lalitha Devi, Karunesh Arora
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
472–480
Language:
URL:
https://aclanthology.org/2024.icon-1.55/
DOI:
Bibkey:
Cite (ACL):
Viswadarshan R R, Viswaa Selvam S, Felicia Lilian J, and Mahalakshmi S. 2024. Enhancing Masked Word Prediction in Tamil Language Models: A Synergistic Approach Using BERT and SBERT. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 472–480, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):
Enhancing Masked Word Prediction in Tamil Language Models: A Synergistic Approach Using BERT and SBERT (R R et al., ICON 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.icon-1.55.pdf