Incident Response and Security Forensics - System Prompt Extraction via Hypothetical Framing Follow-Up Questions
Source document: 05-incident-response-forensics.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 3: System Prompt Extraction via Hypothetical Framing
Scenario lens: Non-literal extraction attempts that use hypothetical, creative, translated, or reframed prompts to expose confidential rules. Document lens: incident response, containment, and forensic investigation.
Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.
Answer document: ANSWERS.md
Easy
- What information in incident response, containment, and forensic investigation would be most sensitive if an attacker used hypothetical framing to extract it?
- Why would literal pattern matching miss many of these attempts, and what safe default response should the system return?
Medium
- How would you split confidential logic between prompt text, configuration, and code so partial leakage is less damaging?
- What post-generation scanning or canary strategy would you use to catch partial disclosure before it reaches the user?
Hard
- How would you build a semantic evaluation set for creative reframing, translation, summarization, or role-play based extraction attempts relevant to this document?
- What are the tradeoffs between a lightweight classifier, a second-pass model, and static rules for detecting non-literal extraction attempts?
Very Hard
- How would you quantify leakage risk when the model never reveals a secret verbatim but exposes enough structure for an attacker to infer it?
- What layered response policy would you define if you had to preserve helpfulness while proving confidential rules, thresholds, or architecture details cannot be reconstructed over repeated probes?