Prompt Injection Defense - Ignore All Previous Instructions Follow-Up Questions
Source document: 01-prompt-injection-defense.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 1: Ignore All Previous Instructions
Scenario lens: Direct instruction override after several benign turns, with prompt dilution and instruction hierarchy as the main risks. Document lens: prompt injection defense architecture.
Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.
Answer document: ANSWERS.md
Easy
- How would a direct instruction-override attempt surface inside prompt injection defense architecture, and which control should engage first: input scanning, instruction reinforcement, source isolation, and output validation?
- What is the first metric or log signal you would inspect to confirm the attack changed behavior rather than just wording: injection compliance rate, prompt leak rate, and false positives?
Medium
- Where is the trust boundary in this design, and how do you keep attacker text from being interpreted as a higher-priority instruction?
- If a legitimate shopper says something like
ignore the spoiler warningsorforget the earlier recommendation, how would you tune the detector to reduce false positives without weakening the defense?
Hard
- How would you redesign the prompt assembly or control flow so instruction dilution in long sessions and creative bypass phrasing cannot silently weaken direct-injection resistance over long conversations?
- What negative tests would you add to CI so a future config, prompt, or dependency change cannot regress the protections described in 01-prompt-injection-defense.md?
Very Hard
- Assume the attacker adaptively probes the system for detector blind spots. How would you use conversation logs, prompt assembly traces, and adversarial regression results to distinguish real resilience gains from attackers merely changing surface wording?
- What failure mode would still remain even after implementing input scanning, instruction reinforcement, source isolation, and output validation, and what second-order mitigation would you add without blowing up latency or blocking legitimate shopping queries?