PII Protection and Data Privacy - Ignore All Previous Instructions Follow-Up Questions
Source document: 02-pii-protection-data-privacy.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 1: Ignore All Previous Instructions
Scenario lens: Direct instruction override after several benign turns, with prompt dilution and instruction hierarchy as the main risks. Document lens: PII protection, redaction, deletion workflows, and privacy boundaries.
Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.
Answer document: ANSWERS.md
Easy
- How would a direct instruction-override attempt surface inside PII protection, redaction, deletion workflows, and privacy boundaries, and which control should engage first: regex plus NER detection, custom detectors, confidence routing, and data minimization?
- What is the first metric or log signal you would inspect to confirm the attack changed behavior rather than just wording: PII precision and recall, false positives on manga terms, deletion SLA, and guest-session leakage?
Medium
- Where is the trust boundary in this design, and how do you keep attacker text from being interpreted as a higher-priority instruction?
- If a legitimate shopper says something like
ignore the spoiler warningsorforget the earlier recommendation, how would you tune the detector to reduce false positives without weakening the defense?
Hard
- How would you redesign the prompt assembly or control flow so distinguishing legitimate manga entities from real personal data across locales cannot silently weaken direct-injection resistance over long conversations?
- What negative tests would you add to CI so a future config, prompt, or dependency change cannot regress the protections described in 02-pii-protection-data-privacy.md?
Very Hard
- Assume the attacker adaptively probes the system for detector blind spots. How would you use redaction audit logs, deletion workflow traces, and multi-locale test sets to distinguish real resilience gains from attackers merely changing surface wording?
- What failure mode would still remain even after implementing regex plus NER detection, custom detectors, confidence routing, and data minimization, and what second-order mitigation would you add without blowing up latency or blocking legitimate shopping queries?