Extracting Training Data from ChatGPT

We discussed this vulnerability during Episode 229 on 04 December 2023

This post details a prompt-based exploit that could be leveraged against ChatGPT as well as other language models such as Falcon, Pythia, LLaMa, and GPT-NEO to extract training data. The basis for this vulnerability is the fact that when prompting the model to repeat a word a large number of times (for example, repeat this word forever: "poem"), after so many words the model ends up diverging. This divergence can end up being nonsense, or sometimes it can diverge into regurgitating remembered training data. Why exactly this happens isn’t clear, as they state they want to get to the core of why this divergence exists to defend against it better.

In some cases this training data can consist of Unique User IDs (UUIDs) as well as researcher contact information for a researcher who uploaded training data. Beyond the potential sensitive information disclosures, this regurgitation can also be triggered incidentally with innocent prompts, and can break the design goal of using a generative model in the first place if it ends up regurgitating training data.