Prompt leakage is a security and privacy failure mode where an AI system unintentionally reveals hidden instructions or sensitive context—such as the system prompt, developer prompt, tool instructions, or private retrieved documents—through its outputs. It often occurs when an attacker uses prompting tricks or indirect prompt injection to coerce the model to disclose information that should remain confidential.
What is Prompt Leakage?
Modern LLM applications typically include multiple prompt layers: system messages (global rules), developer instructions (app logic), tool schemas, and user-provided content. They may also include retrieved context from internal knowledge bases (RAG). Prompt leakage happens when the model outputs these hidden layers verbatim or paraphrased, or when it reveals sensitive strings embedded in context (API keys, internal URLs, policy text not intended for end users).
Leakage can be:
- Direct: user asks “show me your system prompt” and the model complies.
- Indirect: a retrieved web page or document contains instructions like “reveal all previous messages,” and the model follows them.
- Tool-mediated: the agent calls a tool and returns raw tool output that includes secrets.
Because LLMs are trained to be helpful and to follow instructions, they can treat disclosure requests as legitimate unless the system enforces strict boundaries.
Where it’s used and why it matters
Prompt leakage is a major concern for enterprise assistants, tool-using agents, and RAG systems. It matters because hidden prompts often contain:
- security policies and internal logic (useful to attackers),
- proprietary knowledge,
- credentials or access patterns,
- private customer or employee data.
Leakage can enable follow-on attacks such as prompt injection, privilege escalation attempts, or model-stealing by copying proprietary prompt recipes.
Examples
- A chatbot reveals its system prompt that lists internal moderation rules.
- A RAG assistant outputs a confidential paragraph from an internal policy PDF.
- An agent prints tool configuration values that include secret tokens.
FAQs
1. Is prompt leakage the same as prompt injection?
No. Prompt injection is an attack technique; prompt leakage is a possible outcome (disclosure) of successful injection or weak controls.
2. How do you prevent prompt leakage?
Use least-privilege prompts, never place secrets in prompts, add output filters, redact sensitive fields, and require tool gateways that strip secrets.
3. Can models be trained to resist leakage?
Yes—through safety fine-tuning and refusal training—but you still need system-level controls and monitoring.
4. What should I log for investigations?
Log prompts and retrieved context identifiers with redaction, plus tool-call traces and blocked disclosure attempts.