Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models

Felix Stahlberg, Shankar Kumar


Abstract
Tagged corruption models provide precise control over the introduction of grammatical errors into clean text. This capability has made them a powerful tool for generating pre-training data for grammatical error correction (GEC) in English. In this work, we demonstrate their application to four languages with substantially fewer GEC resources than English: German, Romanian, Russian, and Spanish. We release a new tagged-corruption dataset consisting of 2.5M examples per language that was generated by a fine-tuned PaLM 2 foundation model. Pre-training on tagged corruptions yields consistent gains across all four languages, especially for small model sizes and languages with limited human-labelled data.
Anthology ID:
2024.bea-1.2
Volume:
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Ekaterina Kochmar, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–16
Language:
URL:
https://aclanthology.org/2024.bea-1.2
DOI:
Bibkey:
Cite (ACL):
Felix Stahlberg and Shankar Kumar. 2024. Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 11–16, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models (Stahlberg & Kumar, BEA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.bea-1.2.pdf