A Multi-Perspective Analysis of Memorization in Large Language Models

Bowen Chen, Namgi Han, Yusuke Miyao


Abstract
Large Language Models (LLMs) can generate the same sequences contained in the pre-train corpora, known as memorization.Previous research studied it at a macro level, leaving micro yet important questions under-explored, e.g., what makes sentences memorized, the dynamics when generating memorized sequence, its connection to unmemorized sequence, and its predictability.We answer the above questions by analyzing the relationship of memorization with outputs from LLM, namely, embeddings, probability distributions, and generated tokens.A memorization score is calculated as the overlap between generated tokens and actual continuations when the LLM is prompted with a context sequence from the pre-train corpora.Our findings reveal:(1) The inter-correlation between memorized/unmemorized sentences, model size, continuation size, and context size, as well as the transition dynamics between sentences of different memorization scores,(2) A sudden drop and increase in the frequency of input tokens when generating memorized/unmemorized sequences (boundary effect),(3) Cluster of sentences with different memorization scores in the embedding space,(4) An inverse boundary effect in the entropy of probability distributions for generated memorized/unmemorized sequences,(5) The predictability of memorization is related to model size and continuation length. In addition, we show a Transformer model trained by the hidden states of LLM can predict unmemorized tokens.
Anthology ID:
2024.emnlp-main.627
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11190–11209
Language:
URL:
https://aclanthology.org/2024.emnlp-main.627/
DOI:
10.18653/v1/2024.emnlp-main.627
Bibkey:
Cite (ACL):
Bowen Chen, Namgi Han, and Yusuke Miyao. 2024. A Multi-Perspective Analysis of Memorization in Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11190–11209, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
A Multi-Perspective Analysis of Memorization in Large Language Models (Chen et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.627.pdf