ML-Specific Threats and Adversarial AI - System Prompt Extraction via Hypothetical Framing Follow-Up Questions

Source document: 06-ml-specific-threats.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 3: System Prompt Extraction via Hypothetical Framing

Scenario lens: Non-literal extraction attempts that use hypothetical, creative, translated, or reframed prompts to expose confidential rules. Document lens: ML-specific threats such as extraction, poisoning, inversion, and adversarial evasion.

Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.

Answer document: ANSWERS.md

Easy

What information in ML-specific threats such as extraction, poisoning, inversion, and adversarial evasion would be most sensitive if an attacker used hypothetical framing to extract it?
Why would literal pattern matching miss many of these attempts, and what safe default response should the system return?

Medium

How would you split confidential logic between prompt text, configuration, and code so partial leakage is less damaging?
What post-generation scanning or canary strategy would you use to catch partial disclosure before it reaches the user?

Hard

How would you build a semantic evaluation set for creative reframing, translation, summarization, or role-play based extraction attempts relevant to this document?
What are the tradeoffs between a lightweight classifier, a second-pass model, and static rules for detecting non-literal extraction attempts?

Very Hard

How would you quantify leakage risk when the model never reveals a secret verbatim but exposes enough structure for an attacker to infer it?
What layered response policy would you define if you had to preserve helpfulness while proving confidential rules, thresholds, or architecture details cannot be reconstructed over repeated probes?