Encryption and Key Management - Ignore All Previous Instructions Follow-Up Answers

Question document: README.md Source document: 08-encryption-key-management.md Reference scenario: 01-prompt-injection-defense.md -> Scenario 1: Ignore All Previous Instructions

Scenario lens: Direct instruction override after several benign turns, with prompt dilution and instruction hierarchy as the main risks. Document lens: Encryption and Key Management.

Use this file as the answer key for the follow-up questions in README.md.

Easy

Q1

Question: How would a direct instruction-override attempt surface inside encryption, key isolation, secrets rotation, and cryptographic access control, and which control should engage first: CMK separation, envelope encryption, key rotation, secret rotation, and VPC endpoints? Answer: In Encryption and Key Management, treat this as an attempt to bypass the normal control order before the request is even processed. The first defense should run before prompt assembly or tool execution, and later layers should confirm nothing unsafe was executed or revealed.

Q2

Question: What is the first metric or log signal you would inspect to confirm the attack changed behavior rather than just wording: unauthorized decrypt attempts, rotation success rate, encryption latency, and secret age? Answer: Start with the behavior metric, then inspect the exact trace or log for the turn that triggered it. The number tells you whether the issue is systemic; the trace tells you whether the model actually obeyed the override or only echoed the user.

Medium

Q1

Question: Where is the trust boundary in this design, and how do you keep attacker text from being interpreted as a higher-priority instruction? Answer: The trust boundary is simple: trusted policy and tools must stay separate from attacker-controlled text. Keep those channels isolated so user language can influence the task but never rewrite the rules.

Q2

Question: If a legitimate shopper says something like ignore the spoiler warnings or forget the earlier recommendation, how would you tune the detector to reduce false positives without weakening the defense? Answer: Tune on a labeled set that includes benign customer phrases with overlapping keywords. Ambiguous cases should get softer handling or review rather than a hard block, so safety improves without damaging normal usage.

Hard

Q1

Question: How would you redesign the prompt assembly or control flow so protecting sensitive data without turning encryption into a latency bottleneck cannot silently weaken direct-injection resistance over long conversations? Answer: The robust fix is bounded context plus policy enforcement that does not rely entirely on the model remembering early instructions. That prevents long sessions or creative phrasing from silently weakening the defense.

Q2

Question: What negative tests would you add to CI so a future config, prompt, or dependency change cannot regress the protections described in 08-encryption-key-management.md? Answer: CI should cover clean-session attacks, long-session attacks, paraphrases, role hijacks, encoded payloads, and benign collisions. Passing means no leak, no unsafe action, and the right telemetry for later investigation.

Very Hard

Q1

Question: Assume the attacker adaptively probes the system for detector blind spots. How would you use KMS audit logs, key policy diffs, and latency traces to distinguish real resilience gains from attackers merely changing surface wording? Answer: Use traces and repeated-probe clusters to compare pre- and post-change outcomes. If the attacker goal still works through new wording or longer conversations, the defense improved cosmetically rather than structurally.

Q2

Question: What failure mode would still remain even after implementing CMK separation, envelope encryption, key rotation, secret rotation, and VPC endpoints, and what second-order mitigation would you add without blowing up latency or blocking legitimate shopping queries? Answer: A residual risk is a semantic attack that changes tool choice or session context without looking like a classic override. The next layer is session-level policy and escalation around sensitive actions, not just better keyword detection.