RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu
Abstract
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.- Anthology ID:
- 2023.findings-emnlp.936
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 14048–14077
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.936
- DOI:
- 10.18653/v1/2023.findings-emnlp.936
- Bibkey:
- Cite (ACL):
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, et al.. 2023. RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- RWKV: Reinventing RNNs for the Transformer Era (Peng et al., Findings 2023)
- Copy Citation:
- PDF:
- https://aclanthology.org/2023.findings-emnlp.936.pdf
Export citation
@inproceedings{peng-etal-2023-rwkv, title = "{RWKV}: Reinventing {RNN}s for the Transformer Era", author = "Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and Gv, Kranthi and He, Xuzheng and Hou, Haowen and Kazienko, Przemyslaw and Kocon, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan and Wo{\'z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.936", doi = "10.18653/v1/2023.findings-emnlp.936", pages = "14048--14077", abstract = "Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="peng-etal-2023-rwkv"> <titleInfo> <title>RWKV: Reinventing RNNs for the Transformer Era</title> </titleInfo> <name type="personal"> <namePart type="given">Bo</namePart> <namePart type="family">Peng</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Eric</namePart> <namePart type="family">Alcaide</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Quentin</namePart> <namePart type="family">Anthony</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alon</namePart> <namePart type="family">Albalak</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Samuel</namePart> <namePart type="family">Arcadinho</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stella</namePart> <namePart type="family">Biderman</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Huanqi</namePart> <namePart type="family">Cao</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xin</namePart> <namePart type="family">Cheng</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Michael</namePart> <namePart type="family">Chung</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Leon</namePart> <namePart type="family">Derczynski</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xingjian</namePart> <namePart type="family">Du</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matteo</namePart> <namePart type="family">Grella</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kranthi</namePart> <namePart type="family">Gv</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xuzheng</namePart> <namePart type="family">He</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Haowen</namePart> <namePart type="family">Hou</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Przemyslaw</namePart> <namePart type="family">Kazienko</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Kocon</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jiaming</namePart> <namePart type="family">Kong</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bartłomiej</namePart> <namePart type="family">Koptyra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hayden</namePart> <namePart type="family">Lau</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jiaju</namePart> <namePart type="family">Lin</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Krishna</namePart> <namePart type="given">Sri</namePart> <namePart type="given">Ipsit</namePart> <namePart type="family">Mantri</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ferdinand</namePart> <namePart type="family">Mom</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Atsushi</namePart> <namePart type="family">Saito</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Guangyu</namePart> <namePart type="family">Song</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Xiangru</namePart> <namePart type="family">Tang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Johan</namePart> <namePart type="family">Wind</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stanisław</namePart> <namePart type="family">Woźniak</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Zhenyuan</namePart> <namePart type="family">Zhang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Qinghua</namePart> <namePart type="family">Zhou</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jian</namePart> <namePart type="family">Zhu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rui-Jie</namePart> <namePart type="family">Zhu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2023-12</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Findings of the Association for Computational Linguistics: EMNLP 2023</title> </titleInfo> <name type="personal"> <namePart type="given">Houda</namePart> <namePart type="family">Bouamor</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Juan</namePart> <namePart type="family">Pino</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kalika</namePart> <namePart type="family">Bali</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Singapore</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.</abstract> <identifier type="citekey">peng-etal-2023-rwkv</identifier> <identifier type="doi">10.18653/v1/2023.findings-emnlp.936</identifier> <location> <url>https://aclanthology.org/2023.findings-emnlp.936</url> </location> <part> <date>2023-12</date> <extent unit="page"> <start>14048</start> <end>14077</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T RWKV: Reinventing RNNs for the Transformer Era %A Peng, Bo %A Alcaide, Eric %A Anthony, Quentin %A Albalak, Alon %A Arcadinho, Samuel %A Biderman, Stella %A Cao, Huanqi %A Cheng, Xin %A Chung, Michael %A Derczynski, Leon %A Du, Xingjian %A Grella, Matteo %A Gv, Kranthi %A He, Xuzheng %A Hou, Haowen %A Kazienko, Przemyslaw %A Kocon, Jan %A Kong, Jiaming %A Koptyra, Bartłomiej %A Lau, Hayden %A Lin, Jiaju %A Mantri, Krishna Sri Ipsit %A Mom, Ferdinand %A Saito, Atsushi %A Song, Guangyu %A Tang, Xiangru %A Wind, Johan %A Woźniak, Stanisław %A Zhang, Zhenyuan %A Zhou, Qinghua %A Zhu, Jian %A Zhu, Rui-Jie %Y Bouamor, Houda %Y Pino, Juan %Y Bali, Kalika %S Findings of the Association for Computational Linguistics: EMNLP 2023 %D 2023 %8 December %I Association for Computational Linguistics %C Singapore %F peng-etal-2023-rwkv %X Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks. %R 10.18653/v1/2023.findings-emnlp.936 %U https://aclanthology.org/2023.findings-emnlp.936 %U https://doi.org/10.18653/v1/2023.findings-emnlp.936 %P 14048-14077
Markdown (Informal)
[RWKV: Reinventing RNNs for the Transformer Era](https://aclanthology.org/2023.findings-emnlp.936) (Peng et al., Findings 2023)
- RWKV: Reinventing RNNs for the Transformer Era (Peng et al., Findings 2023)
ACL
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, et al.. 2023. RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.