Parameter-Efficient Korean Character-Level Language Modeling

Marco Cognetta, Sangwhan Moon, Lawrence Wolf-sonkin, Naoaki Okazaki


Abstract
Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.
Anthology ID:
2023.eacl-main.172
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2350–2356
Language:
URL:
https://aclanthology.org/2023.eacl-main.172
DOI:
10.18653/v1/2023.eacl-main.172
Bibkey:
Cite (ACL):
Marco Cognetta, Sangwhan Moon, Lawrence Wolf-sonkin, and Naoaki Okazaki. 2023. Parameter-Efficient Korean Character-Level Language Modeling. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2350–2356, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Parameter-Efficient Korean Character-Level Language Modeling (Cognetta et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.172.pdf
Video:
 https://aclanthology.org/2023.eacl-main.172.mp4