Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux


Abstract
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.
Anthology ID:
2024.emnlp-main.302
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5284–5292
Language:
URL:
https://aclanthology.org/2024.emnlp-main.302/
DOI:
10.18653/v1/2024.emnlp-main.302
Bibkey:
Cite (ACL):
Maxime Poli, Emmanuel Chemla, and Emmanuel Dupoux. 2024. Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5284–5292, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach (Poli et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.302.pdf