Prompt Injection Defense - Ignore All Previous Instructions Follow-Up Questions

Source document: 01-prompt-injection-defense.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 1: Ignore All Previous Instructions

Scenario lens: Direct instruction override after several benign turns, with prompt dilution and instruction hierarchy as the main risks. Document lens: prompt injection defense architecture.

Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.

Answer document: ANSWERS.md

Easy

How would a direct instruction-override attempt surface inside prompt injection defense architecture, and which control should engage first: input scanning, instruction reinforcement, source isolation, and output validation?
What is the first metric or log signal you would inspect to confirm the attack changed behavior rather than just wording: injection compliance rate, prompt leak rate, and false positives?

Medium

Where is the trust boundary in this design, and how do you keep attacker text from being interpreted as a higher-priority instruction?
If a legitimate shopper says something like ignore the spoiler warnings or forget the earlier recommendation, how would you tune the detector to reduce false positives without weakening the defense?

Hard

How would you redesign the prompt assembly or control flow so instruction dilution in long sessions and creative bypass phrasing cannot silently weaken direct-injection resistance over long conversations?
What negative tests would you add to CI so a future config, prompt, or dependency change cannot regress the protections described in 01-prompt-injection-defense.md?

Very Hard

Assume the attacker adaptively probes the system for detector blind spots. How would you use conversation logs, prompt assembly traces, and adversarial regression results to distinguish real resilience gains from attackers merely changing surface wording?
What failure mode would still remain even after implementing input scanning, instruction reinforcement, source isolation, and output validation, and what second-order mitigation would you add without blowing up latency or blocking legitimate shopping queries?