SocBERT: A Pretrained Model for Social Media Text

Yuting Guo, Abeed Sarker


Abstract
Pretrained language models (PLMs) on domain-specific data have been proven to be effective for in-domain natural language processing (NLP) tasks. Our work aimed to develop a language model which can be effective for the NLP tasks with the data from diverse social media platforms. We pretrained a language model on Twitter and Reddit posts in English consisting of 929M sequence blocks for 112K steps. We benchmarked our model and 3 transformer-based models—BERT, BERTweet, and RoBERTa on 40 social media text classification tasks. The results showed that although our model did not perform the best on all of the tasks, it outperformed the baseline model—BERT on most of the tasks, which illustrates the effectiveness of our model. Also, our work provides some insights of how to improve the efficiency of training PLMs.
Anthology ID:
2023.insights-1.5
Volume:
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Shabnam Tafreshi, Arjun Akula, João Sedoc, Aleksandr Drozd, Anna Rogers, Anna Rumshisky
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45–52
Language:
URL:
https://aclanthology.org/2023.insights-1.5
DOI:
10.18653/v1/2023.insights-1.5
Bibkey:
Cite (ACL):
Yuting Guo and Abeed Sarker. 2023. SocBERT: A Pretrained Model for Social Media Text. In Proceedings of the Fourth Workshop on Insights from Negative Results in NLP, pages 45–52, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
SocBERT: A Pretrained Model for Social Media Text (Guo & Sarker, insights-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.insights-1.5.pdf
Video:
 https://aclanthology.org/2023.insights-1.5.mp4