LOCAL PREVIEW View on GitHub

PII Protection and Data Privacy - Ignore All Previous Instructions Follow-Up Questions

Source document: 02-pii-protection-data-privacy.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 1: Ignore All Previous Instructions

Scenario lens: Direct instruction override after several benign turns, with prompt dilution and instruction hierarchy as the main risks. Document lens: PII protection, redaction, deletion workflows, and privacy boundaries.

Use these prompts to push past the base scenario and explore deeper design, operational, interview, or storytelling tradeoffs.

Answer document: ANSWERS.md

Easy

  1. How would a direct instruction-override attempt surface inside PII protection, redaction, deletion workflows, and privacy boundaries, and which control should engage first: regex plus NER detection, custom detectors, confidence routing, and data minimization?
  2. What is the first metric or log signal you would inspect to confirm the attack changed behavior rather than just wording: PII precision and recall, false positives on manga terms, deletion SLA, and guest-session leakage?

Medium

  1. Where is the trust boundary in this design, and how do you keep attacker text from being interpreted as a higher-priority instruction?
  2. If a legitimate shopper says something like ignore the spoiler warnings or forget the earlier recommendation, how would you tune the detector to reduce false positives without weakening the defense?

Hard

  1. How would you redesign the prompt assembly or control flow so distinguishing legitimate manga entities from real personal data across locales cannot silently weaken direct-injection resistance over long conversations?
  2. What negative tests would you add to CI so a future config, prompt, or dependency change cannot regress the protections described in 02-pii-protection-data-privacy.md?

Very Hard

  1. Assume the attacker adaptively probes the system for detector blind spots. How would you use redaction audit logs, deletion workflow traces, and multi-locale test sets to distinguish real resilience gains from attackers merely changing surface wording?
  2. What failure mode would still remain even after implementing regex plus NER detection, custom detectors, confidence routing, and data minimization, and what second-order mitigation would you add without blowing up latency or blocking legitimate shopping queries?