LOCAL PREVIEW View on GitHub

Skill 3.1.5: Advanced Adversarial Threat Detection

Task: Task 3.1
Goal: Detect prompt injection, jailbreaks, obfuscation, and emerging LLM-specific threats before they compromise the chatbot.

User Story

As an AI Security Engineer, I want MangaAssist to detect adversarial inputs and continuously test its own defenses so that prompt injection, jailbreak attempts, and security bypass patterns are found early and handled consistently.

Grounded Scenarios

Scenario Why It Matters
A user inserts zero-width characters into "ignore previous instructions" to evade filters Attackers rarely send clean benchmark strings
A malicious product review asks the model to reveal internal prompts when retrieved Indirect prompt injection can arrive from trusted-looking data sources
A user pastes base64 or ROT13 text with a jailbreak instruction Obfuscated attacks require canonicalization and anomaly detection

Deep-Dive Design

1. Canonicalization Pipeline

Before classification:

  • normalize whitespace and Unicode
  • detect unusual character distributions
  • identify encoded payload indicators
  • compute basic lexical anomaly features such as excessive delimiters or roleplay phrases

This converts many "novel" attacks into recognizable patterns.

2. Hybrid Detection Stack

Use three complementary detectors:

  1. Signature rules for known attack phrases and prompt-leak templates
  2. Safety classifiers for jailbreak probability, instruction conflict, and exfiltration intent
  3. Anomaly scores for unfamiliar payload shapes, long-context flooding, or suspicious encoding

No single detector is reliable enough on its own.

3. Threat Scoring and Policy Actions

Combine detector outputs into a threat score that drives action:

  • low: log only
  • medium: allow but harden prompt and disable sensitive tools
  • high: block or safe-complete
  • critical: quarantine session, throttle, and alert security analysts

Sometimes we can still answer safely if we disable risky capabilities.

4. Retrieval and Tool Hardening

Threat detection should also inspect:

  • retrieved KB chunks
  • imported review content
  • tool outputs from order or support systems

If suspicious context is detected, tag it as untrusted and either drop it or wrap it in stronger instructions that forbid obedience to embedded directives.

5. Automated Adversarial Testing

Create a Step Functions workflow that runs nightly or per release:

  • replay known jailbreak corpora
  • mutate prompts with obfuscation techniques
  • inject malicious snippets into synthetic retrieval chunks
  • compare current detection performance against the previous release

This turns threat detection into a continuously tested system instead of a static ruleset.

Acceptance Criteria

  • Obfuscated prompt injection attempts are detected above an agreed recall target.
  • Suspicious retrieved context can be quarantined before prompt assembly.
  • High-risk sessions trigger throttling or restricted-mode execution.
  • Red-team regression suites run automatically on prompt, model, or policy changes.
  • Security analysts can inspect attack telemetry without accessing unnecessary customer data.

Signals and Metrics

  • jailbreak detection recall and precision
  • indirect prompt injection catch rate
  • number of attacks downgraded to safe mode instead of full block
  • median time to add a new attack signature
  • adversarial regression pass rate by release

Failure Modes and Tradeoffs

  • High false positives can punish power users with odd formatting. Mitigation: separate "risky style" from "harmful intent."
  • Attack evolution quickly outpaces static regexes. Mitigation: keep classifier and red-team loops active.
  • Detection without containment is incomplete. Mitigation: connect threat scores to tool disabling, safe mode, and analyst alerts.

Interview Takeaway

Advanced threat detection is part security engineering, part evaluation engineering. The mature design combines canonicalization, hybrid detectors, containment actions, and automated adversarial testing.