Skill 3.1.5: Advanced Adversarial Threat Detection
Task: Task 3.1
Goal: Detect prompt injection, jailbreaks, obfuscation, and emerging LLM-specific threats before they compromise the chatbot.
User Story
As an AI Security Engineer, I want MangaAssist to detect adversarial inputs and continuously test its own defenses so that prompt injection, jailbreak attempts, and security bypass patterns are found early and handled consistently.
Grounded Scenarios
| Scenario | Why It Matters |
|---|---|
| A user inserts zero-width characters into "ignore previous instructions" to evade filters | Attackers rarely send clean benchmark strings |
| A malicious product review asks the model to reveal internal prompts when retrieved | Indirect prompt injection can arrive from trusted-looking data sources |
| A user pastes base64 or ROT13 text with a jailbreak instruction | Obfuscated attacks require canonicalization and anomaly detection |
Deep-Dive Design
1. Canonicalization Pipeline
Before classification:
- normalize whitespace and Unicode
- detect unusual character distributions
- identify encoded payload indicators
- compute basic lexical anomaly features such as excessive delimiters or roleplay phrases
This converts many "novel" attacks into recognizable patterns.
2. Hybrid Detection Stack
Use three complementary detectors:
- Signature rules for known attack phrases and prompt-leak templates
- Safety classifiers for jailbreak probability, instruction conflict, and exfiltration intent
- Anomaly scores for unfamiliar payload shapes, long-context flooding, or suspicious encoding
No single detector is reliable enough on its own.
3. Threat Scoring and Policy Actions
Combine detector outputs into a threat score that drives action:
- low: log only
- medium: allow but harden prompt and disable sensitive tools
- high: block or safe-complete
- critical: quarantine session, throttle, and alert security analysts
Sometimes we can still answer safely if we disable risky capabilities.
4. Retrieval and Tool Hardening
Threat detection should also inspect:
- retrieved KB chunks
- imported review content
- tool outputs from order or support systems
If suspicious context is detected, tag it as untrusted and either drop it or wrap it in stronger instructions that forbid obedience to embedded directives.
5. Automated Adversarial Testing
Create a Step Functions workflow that runs nightly or per release:
- replay known jailbreak corpora
- mutate prompts with obfuscation techniques
- inject malicious snippets into synthetic retrieval chunks
- compare current detection performance against the previous release
This turns threat detection into a continuously tested system instead of a static ruleset.
Acceptance Criteria
- Obfuscated prompt injection attempts are detected above an agreed recall target.
- Suspicious retrieved context can be quarantined before prompt assembly.
- High-risk sessions trigger throttling or restricted-mode execution.
- Red-team regression suites run automatically on prompt, model, or policy changes.
- Security analysts can inspect attack telemetry without accessing unnecessary customer data.
Signals and Metrics
- jailbreak detection recall and precision
- indirect prompt injection catch rate
- number of attacks downgraded to safe mode instead of full block
- median time to add a new attack signature
- adversarial regression pass rate by release
Failure Modes and Tradeoffs
- High false positives can punish power users with odd formatting. Mitigation: separate "risky style" from "harmful intent."
- Attack evolution quickly outpaces static regexes. Mitigation: keep classifier and red-team loops active.
- Detection without containment is incomplete. Mitigation: connect threat scores to tool disabling, safe mode, and analyst alerts.
Interview Takeaway
Advanced threat detection is part security engineering, part evaluation engineering. The mature design combines canonicalization, hybrid detectors, containment actions, and automated adversarial testing.