Incident Response and Security Forensics - System Prompt Extraction via Hypothetical Framing Follow-Up Answers

Question document: README.md Source document: 05-incident-response-forensics.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 3: System Prompt Extraction via Hypothetical Framing

Scenario lens: Non-literal extraction attempts that use hypothetical, creative, translated, or reframed prompts to expose confidential rules. Document lens: Incident Response and Security Forensics.

Use this file as the answer key for the follow-up questions in README.md.

Easy

Q1

Question: What information in incident response, containment, and forensic investigation would be most sensitive if an attacker used hypothetical framing to extract it? Answer: The sensitive target is any detail that teaches an attacker how the system constrains access or exposes privileged logic. Even partial leakage matters because structure and thresholds are enough to plan evasion.

Q2

Question: Why would literal pattern matching miss many of these attempts, and what safe default response should the system return? Answer: Literal matching misses paraphrase, translation, role-play, and hypothetical framing. The safe default is a short refusal plus redirect, not a custom explanation that reveals even more internal detail.

Medium

Q1

Question: How would you split confidential logic between prompt text, configuration, and code so partial leakage is less damaging? Answer: Split generic behavior into prompt text, but keep sensitive thresholds, secrets, or internal rules in code or config. That way partial prompt leakage is inconvenient instead of catastrophic.

Q2

Question: What post-generation scanning or canary strategy would you use to catch partial disclosure before it reaches the user? Answer: Use post-generation scanning for canary phrases, protected tokens, or config fragments before the answer leaves the system. It is a last line of defense, but it catches failures the input layer missed.

Hard

Q1

Question: How would you build a semantic evaluation set for creative reframing, translation, summarization, or role-play based extraction attempts relevant to this document? Answer: The eval set should mix direct asks, hypotheticals, poems, summaries, translations, and cross-lingual reframing. Score both exact leakage and partial reconstruction of the hidden logic.

Q2

Question: What are the tradeoffs between a lightweight classifier, a second-pass model, and static rules for detecting non-literal extraction attempts? Answer: Static rules are cheap for known phrases, lightweight classifiers add semantic coverage, and second-pass models improve recall at higher cost. In practice you combine them because each catches a different failure mode.

Very Hard

Q1

Question: How would you quantify leakage risk when the model never reveals a secret verbatim but exposes enough structure for an attacker to infer it? Answer: Measure leakage as reconstructability, not just exact string match. If an attacker can infer boundaries, thresholds, or workflow shape from fragments, that is still meaningful disclosure.

Q2

Question: What layered response policy would you define if you had to preserve helpfulness while proving confidential rules, thresholds, or architecture details cannot be reconstructed over repeated probes? Answer: A strong response policy narrows what the system can say about internal rules, shifts to boring canned replies for sensitive topics, and logs repeated probing for follow-up review. Helpfulness stays high for normal tasks while internal detail stays intentionally vague.