Detecting Training Text in Language Models
Results
We find that, for the models we tested, it is possible to detect with high accuracy whether a passage of text was in the training set of a language model. Our main results are as follows:
-
Detection accuracy: For GPT-2, GPT-3, and GPT-4, we can detect whether a passage was in the training set with over 95% accuracy using a simple classifier based on model loss.
-
Loss-based detection: Passages that were in the training set have significantly lower loss (i.e., are more predictable) than passages that were not in the training set. This difference is robust across models and datasets.
-
Generalization: The detection method generalizes across different types of text and is not limited to specific domains or genres.
-
Implications: These results suggest that it is feasible to audit language models for the presence of specific training data, which has implications for copyright, privacy, and responsible AI development.
For more details, see the full report and methodology.
Related Papers
-
Shi, Weijia et al. (2023). Detecting Pretraining Data from Large Language Models.
arXiv:2310.16789
-
Zhang, Weichao et al. (2024). Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method.
arXiv:2409.14781
-
Duarte, André V. et al. (2024). DE-COP: Detecting Copyrighted Content in Language Models Training Data.
arXiv:2402.09910
-
Meeus, Matthieu et al. (2024). Copyright Traps for Large Language Models.
PDF: arXiv:2402.09363
-
Ni, Shiwen et al. (2024). Training on the Benchmark Is Not All You Need.
arXiv:2409.01790
-
Wang, Jeffrey G. et al. (2024). Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models.
arXiv:2402.17012
-
Zhang, Anqi and Wu, Chaofeng (2024). Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens.
arXiv:2407.21248
-
Zhang, Jingyang et al. (2024). Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models.
arXiv:2404.02936
-
Wei, Johnny Tian-Zheng et al. (2024). Proving membership in LLM pretraining data via data watermarks.
arXiv:2402.10892
-
Meeus, Matthieu et al. (2023). Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models.
arXiv:2310.15007
Summary: Copyright Traps for Large Language Models
The paper Copyright Traps for Large Language Models addresses the ongoing debate about the fair use of copyright-protected content in training large language models (LLMs). The authors explore "document-level inference," which aims to determine if a specific piece of content was present in a model's training data using only black-box access. While state-of-the-art methods rely on natural memorization, the paper investigates the use of deliberately inserted "copyright traps"—unique sequences embedded in training data—to detect unauthorized use. The study finds that medium-length traps repeated a moderate number of times are not reliably detectable, but longer sequences repeated many times can be detected and used as effective copyright traps. This has significant implications for copyright enforcement and auditing of LLMs.
Copyright Traps
Datasets Used for Building LLMs
For a detailed overview of various sources and datasets used for training large language models (LLMs), please visit our dedicated wiki page:
Datasets Used for Building LLMs.
This resource describes the types of data, their provenance, and considerations for responsible AI development.
Responsible AI Conferences
For a curated and regularly updated list of major conferences focused on responsible AI—including policy, governance, ethics, fairness, and domain-specific applications—please visit our dedicated wiki page:
Responsible AI Conferences.
This resource highlights key events, dates, locations, and opportunities for engagement in the responsible AI community.