Skill 3.1.5: Advanced Adversarial Threat Detection

Task: Task 3.1
Goal: Detect prompt injection, jailbreaks, obfuscation, and emerging LLM-specific threats before they compromise the chatbot.

User Story

As an AI Security Engineer, I want MangaAssist to detect adversarial inputs and continuously test its own defenses so that prompt injection, jailbreak attempts, and security bypass patterns are found early and handled consistently.

Grounded Scenarios

Scenario	Why It Matters
A user inserts zero-width characters into "ignore previous instructions" to evade filters	Attackers rarely send clean benchmark strings
A malicious product review asks the model to reveal internal prompts when retrieved	Indirect prompt injection can arrive from trusted-looking data sources
A user pastes base64 or ROT13 text with a jailbreak instruction	Obfuscated attacks require canonicalization and anomaly detection

Deep-Dive Design

1. Canonicalization Pipeline

Before classification:

normalize whitespace and Unicode
detect unusual character distributions
identify encoded payload indicators
compute basic lexical anomaly features such as excessive delimiters or roleplay phrases

This converts many "novel" attacks into recognizable patterns.

2. Hybrid Detection Stack

Use three complementary detectors:

Signature rules for known attack phrases and prompt-leak templates
Safety classifiers for jailbreak probability, instruction conflict, and exfiltration intent
Anomaly scores for unfamiliar payload shapes, long-context flooding, or suspicious encoding

No single detector is reliable enough on its own.

3. Threat Scoring and Policy Actions

Combine detector outputs into a threat score that drives action:

low: log only
medium: allow but harden prompt and disable sensitive tools
high: block or safe-complete
critical: quarantine session, throttle, and alert security analysts

Sometimes we can still answer safely if we disable risky capabilities.

4. Retrieval and Tool Hardening

Threat detection should also inspect:

retrieved KB chunks
imported review content
tool outputs from order or support systems

If suspicious context is detected, tag it as untrusted and either drop it or wrap it in stronger instructions that forbid obedience to embedded directives.

5. Automated Adversarial Testing

Create a Step Functions workflow that runs nightly or per release:

replay known jailbreak corpora
mutate prompts with obfuscation techniques
inject malicious snippets into synthetic retrieval chunks
compare current detection performance against the previous release

This turns threat detection into a continuously tested system instead of a static ruleset.

Acceptance Criteria

Obfuscated prompt injection attempts are detected above an agreed recall target.
Suspicious retrieved context can be quarantined before prompt assembly.
High-risk sessions trigger throttling or restricted-mode execution.
Red-team regression suites run automatically on prompt, model, or policy changes.
Security analysts can inspect attack telemetry without accessing unnecessary customer data.

Signals and Metrics

jailbreak detection recall and precision
indirect prompt injection catch rate
number of attacks downgraded to safe mode instead of full block
median time to add a new attack signature
adversarial regression pass rate by release

Failure Modes and Tradeoffs

High false positives can punish power users with odd formatting. Mitigation: separate "risky style" from "harmful intent."
Attack evolution quickly outpaces static regexes. Mitigation: keep classifier and red-team loops active.
Detection without containment is incomplete. Mitigation: connect threat scores to tool disabling, safe mode, and analyst alerts.

Interview Takeaway

Advanced threat detection is part security engineering, part evaluation engineering. The mature design combines canonicalization, hybrid detectors, containment actions, and automated adversarial testing.