An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Jay Kumar, Junming Shao, Salah Uddin, Wazir Ali


Abstract
Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine the optimal batch size is usually a difficult task since we have no priori knowledge when the topics evolve. In addition, traditional independent word representation in graphical model tends to cause “term ambiguity” problem in short text clustering. Therefore, in this paper, we propose an Online Semantic-enhanced Dirichlet Model for short sext stream clustering, called OSDM, which integrates the word-occurance semantic information (i.e., context) into a new graphical model and clusters each arriving short text automatically in an online way. Extensive results have demonstrated that OSDM has better performance compared to many state-of-the-art algorithms on both synthetic and real-world data sets.
Anthology ID:
2020.acl-main.70
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
766–776
Language:
URL:
https://aclanthology.org/2020.acl-main.70
DOI:
10.18653/v1/2020.acl-main.70
Bibkey:
Cite (ACL):
Jay Kumar, Junming Shao, Salah Uddin, and Wazir Ali. 2020. An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 766–776, Online. Association for Computational Linguistics.
Cite (Informal):
An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering (Kumar et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.70.pdf
Video:
 http://slideslive.com/38928978