Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljacic


Abstract
Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP
Anthology ID:
2024.lrec-main.392
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4392–4403
Language:
URL:
https://aclanthology.org/2024.lrec-main.392
DOI:
Bibkey:
Cite (ACL):
Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, and Marin Soljacic. 2024. Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4392–4403, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks (Rugina et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.392.pdf