8. Encryption and Key Management
Encryption in MangaAssist is not a checkbox feature. It is the control set that limits blast radius when the application is wrong, when a role is over-permissioned, when logs are copied, when backups are restored, or when a downstream service gets compromised. The hard part is rarely "did we turn on AES-256?" The hard part is deciding where encryption belongs in the dataflow, which key should protect which asset, who is allowed to decrypt, how rotation happens without downtime, and how misuse gets detected before it becomes a breach.
This document goes deeper than the baseline architecture and answers seven engineering questions:
- What exactly is encrypted in MangaAssist, and which key protects each data class?
- How does the runtime use KMS without putting a network round trip on every field operation?
- What does the high-level design look like, including trust boundaries and control-plane ownership?
- What does the low-level implementation look like for envelope encryption, secrets retrieval, storage, and transit protection?
- How do planned rotation, emergency rotation, break-glass access, and disaster recovery actually work?
- What failure modes matter most in practice, and how do we monitor for them?
- What follow-up questions should you expect in a design review or interview, and what are strong deep-dive answers?
Why This Matters for MangaAssist
MangaAssist handles several different kinds of data, and they should not all be protected the same way:
| Data Class | Examples | Risk if Exposed | Encryption Pattern |
|---|---|---|---|
| Public product data | ASIN, title, author, format | Low direct privacy risk, moderate business integrity risk | Transport encryption only; no field-level encryption |
| Internal operational data | Session IDs, rate-limit counters, feature flags | Service abuse, internal recon | TLS + table/object encryption |
| Sensitive customer context | Name, email, shipping address, phone, order references | Direct privacy breach, fraud, legal exposure | Field-level envelope encryption + store-level encryption |
| Audit evidence | Prompt/response metadata, guardrail decisions, incident evidence | Tampering risk, forensic loss | Separate audit CMK + object lock + immutable retention |
| Secrets | API credentials, service auth material, webhook secrets | Credential theft, lateral movement | Secrets Manager or Parameter Store with dedicated CMK |
The architecture therefore has to answer three different questions at once:
- How do we keep data encrypted at rest and in transit?
- How do we stop broad decryption rights from spreading across the system?
- How do we preserve availability when rotation or incident response happens under load?
Design Goals
- Use different keys for different trust zones so one policy mistake does not expose every store.
- Keep the synchronous chat path fast enough for a responsive experience.
- Make decryption rare, explicit, logged, and attributable to a narrow runtime role.
- Separate key administration from data access so security admins do not automatically become data readers.
- Support routine rotation with no downtime and emergency containment with a clear blast-radius model.
- Ensure restore, replication, analytics, and deletion workflows remain possible after encryption is added.
Core Cryptography Concepts
| Term | Meaning | Why It Matters Here |
|---|---|---|
| CMK / KEK | Customer-managed KMS key that protects other keys or service-side encryption | Defines access policy, audit trail, and blast radius |
| DEK | Data encryption key generated for actual payload encryption | Used locally for fast AES-GCM operations |
| Envelope encryption | KMS protects the DEK, application uses the DEK for data | Avoids a KMS call for every payload block |
| Encryption context | Non-secret metadata bound to a KMS operation | Prevents a ciphertext from being reused in the wrong workflow |
| Alias | Stable name that points to a KMS key | Lets applications stay constant while key backing changes |
| Grant | Narrow delegated permission for a service to use a KMS key | Useful for managed services and controlled automation |
| Bucket key | S3 optimization that reduces repeated KMS calls | Important for high-volume audit logging |
Two design rules drive the rest of the document:
- KMS is the root of trust, not the bulk encryption engine.
- The application should only hold plaintext DEKs in memory, for a short time, in a narrow runtime.
Threat Model and Trust Boundaries
Main Failure Modes
| Failure Mode | Example | Impact | First Control |
|---|---|---|---|
| Over-broad decryption rights | Analytics role gets kms:Decrypt on PII key |
Silent privacy breach | Separate PII CMK + role isolation |
| KMS on hot path everywhere | One KMS call per PII field | Latency spike, throttling, cost growth | Envelope encryption + cache limits |
| Logs store sensitive data before redaction | Application logs raw address or email | Wide operational exposure | Pre-log redaction + log group KMS |
| Backup restore misses key permissions | Data restores, app cannot decrypt | Operational outage during DR | Restore playbook includes KMS grants validation |
| Key compromise response is too slow | Suspicious decrypts continue for hours | Larger blast radius | CloudTrail alerting + role containment runbook |
| Cached DEK leaked from runtime memory | Compromised warm container | Limited plaintext exposure | Per-instance cache, TTL, usage caps, rapid drain |
| Key admin can also read data | Same team manages keys and app roles | Separation-of-duties failure | Distinct admin and decrypt roles |
Trust Boundary View
flowchart TB
subgraph Untrusted["Untrusted / User-Controlled"]
User[User message]
Browser[Web or mobile client]
Hist[Conversation text]
end
subgraph App["Application Decision Layer"]
Gateway[API Gateway]
Orch[Chat orchestrator]
Guard[Guardrails and privacy policy]
PII[PII encryption handler]
SecretCache[Secret cache]
end
subgraph Crypto["Cryptographic Control Plane"]
KMS[AWS KMS CMKs]
SM[Secrets Manager]
IAM[IAM roles and key policies]
CT[CloudTrail]
end
subgraph Stores["Persistent Stores"]
DDB[DynamoDB session store]
S3[S3 audit archive]
OS[OpenSearch index]
CW[CloudWatch logs]
end
subgraph SecOps["Security Operations"]
Alerts[EventBridge and alerts]
SIEM[Security analytics]
end
Browser --> Gateway
Hist --> Orch
User --> Gateway
Gateway --> Orch
Orch --> Guard
Guard --> PII
Orch --> SecretCache
SecretCache --> SM
PII --> KMS
SM --> KMS
DDB --> KMS
S3 --> KMS
OS --> KMS
CW --> KMS
IAM --> KMS
CT --> Alerts
Alerts --> SIEM
Key boundary rule: raw sensitive fields may exist ephemerally in memory when needed to serve the request, but persistence boundaries should receive either redacted values or encrypted fields plus enough metadata to decrypt them later in a controlled path.
Control Matrix
| Asset | Store | At-Rest Control | Field-Level Control | In-Transit Control | Decrypting Runtime | Primary Key |
|---|---|---|---|---|---|---|
| Session metadata | DynamoDB | SSE-KMS | None | TLS over VPC endpoint | Chat orchestrator | alias/mangaassist-app |
| Sensitive chat fields | DynamoDB | SSE-KMS | Envelope encryption with AES-256-GCM | TLS over VPC endpoint | PII handler only | alias/mangaassist-pii |
| Audit evidence | S3 + CloudWatch | SSE-KMS + object lock | Optional field tokenization only | TLS + VPC endpoint | Security investigator role | alias/mangaassist-audit |
| Search index fragments | OpenSearch | SSE-KMS + node-to-node encryption | Do not index raw PII | HTTPS enforced | Search service | alias/mangaassist-app |
| Secrets | Secrets Manager / SSM | SSE-KMS | N/A | TLS + SigV4 | Service runtime through secrets API | alias/mangaassist-secrets |
| Cache entries | ElastiCache | At-rest encryption | No raw PII cache | In-transit encryption | Private app subnets only | alias/mangaassist-app |
Two important consequences fall out of this table:
- Some data is protected twice: once by the store and once at the field level.
- Search and analytics are intentionally designed to avoid decrypting raw PII whenever possible.
High-Level Design (HLD)
System Overview
graph TB
subgraph Client["Client and Edge"]
U[User]
FE[Web or mobile client]
GW[API Gateway]
end
subgraph Runtime["MangaAssist Runtime"]
AUTH[Auth and session]
ORCH[Chat orchestrator]
GRD[Guardrails and privacy policy]
PII[PII encryption service]
AUD[Async audit writer]
end
subgraph AI["Model and Service Layer"]
LLM[Bedrock model]
CAT[Catalog and order services]
REC[Recommendation services]
end
subgraph Data["Persistent Stores"]
DDB[DynamoDB sessions]
S3[S3 audit archive]
OS[OpenSearch]
CW[CloudWatch logs]
SEC[Secrets Manager]
end
subgraph Control["Crypto and Security Control Plane"]
KMS[KMS key ring]
IAM[IAM and key policies]
CT[CloudTrail]
EVT[EventBridge alerts]
end
subgraph DR["Recovery"]
DRS3[Replica audit bucket]
DRKMS[DR region keys]
end
U --> FE --> GW
GW --> AUTH
GW --> ORCH
AUTH --> ORCH
ORCH --> GRD
GRD --> LLM
ORCH --> CAT
ORCH --> REC
ORCH --> PII
ORCH --> DDB
ORCH --> OS
ORCH --> AUD
AUD --> S3
ORCH --> CW
ORCH --> SEC
PII --> KMS
DDB --> KMS
S3 --> KMS
OS --> KMS
CW --> KMS
SEC --> KMS
IAM --> KMS
CT --> EVT
S3 --> DRS3
DRS3 --> DRKMS
HLD Principles
- The data plane and key control plane are separate. The app can request cryptographic operations, but policy authority remains in KMS and IAM.
- PII decryption is concentrated in one narrow runtime path instead of being spread across the orchestrator, analytics, indexing, and logging pipelines.
- Managed store encryption is always enabled, but it is not treated as a substitute for field-level protection of high-sensitivity data.
- Aliases give stable integration points for applications, while underlying keys can rotate or be replaced.
- Audit evidence uses a different key from the chat session store because forensic access patterns and retention rules are different.
- Disaster recovery planning includes keys, grants, aliases, and restore scripts, not just data copies.
Key Hierarchy
| Key | Alias | Protected Assets | Rotation Style | Who Can Administer | Who Can Decrypt |
|---|---|---|---|---|---|
| App key | alias/mangaassist-app |
DynamoDB table SSE, OpenSearch, cache, low-sensitivity app data | KMS automatic annual rotation | Security platform | Managed services and orchestrator runtime where needed |
| PII key | alias/mangaassist-pii |
Field-level encrypted names, emails, addresses, phone numbers, order references | KMS automatic annual rotation; emergency alias swap if incident | Security platform only | PII handler runtime, break-glass investigator role |
| Audit key | alias/mangaassist-audit |
S3 audit archive, immutable evidence, security log groups | KMS automatic annual rotation; cross-region replica pairing | Security platform | Audit reader or incident responder only |
| Secrets key | alias/mangaassist-secrets |
Secrets Manager, secure parameters, rotation metadata | KMS automatic annual rotation | Security platform | Secrets Manager service path and approved runtimes |
The major design choice is that alias/mangaassist-pii is not reused for logs, search, or secrets. That keeps the highest-risk decrypt path isolated.
Scenario 1: Normal Request Dataflow
The standard request path matters because encryption design is only useful if it fits the latency budget.
sequenceDiagram
participant User
participant Gateway
participant Orch as Orchestrator
participant Guard as Guardrails
participant Secrets as Secrets Manager
participant KMS
participant PII as PII Encryptor
participant DDB
participant Audit as Audit Writer
participant S3
User->>Gateway: Chat message
Gateway->>Orch: Authenticated request
Orch->>Secrets: Get downstream credentials if cache miss
Secrets->>KMS: Decrypt secret version
KMS-->>Secrets: Plaintext secret in service memory
Secrets-->>Orch: Secret value
Orch->>Guard: Build prompt with approved fields only
Guard-->>Orch: Safe model request
Orch->>PII: Encrypt fields that may persist
PII->>KMS: GenerateDataKey(classification=pii, purpose=chat_storage)
KMS-->>PII: Plaintext DEK + encrypted DEK
PII-->>Orch: Ciphertext payload + metadata
Orch->>DDB: Write session item with field ciphertext
Orch-->>Gateway: User response
Gateway-->>User: Final response
Orch->>Audit: Async audit event
Audit->>S3: Store immutable audit object with SSE-KMS
Synchronous vs Asynchronous Split
| Path Segment | Why It Stays on the Hot Path | Why It Might Move Off the Hot Path |
|---|---|---|
| Secret retrieval | Needed to call downstream APIs | Use in-memory secret cache to reduce repeated reads |
| Field encryption before persistence | Required before writing sensitive data | Keep local AES work in process; avoid repeated KMS calls |
| Audit archival | Not needed to answer the user | Push to async writer and S3 |
| Re-encryption migrations | Correctness, not immediate response | Always background |
The big practical lesson is that encryption belongs close to persistence boundaries, not in the middle of every intermediate transformation stage.
Low-Level Design (LLD)
1. Envelope Encryption for PII Fields
Why We Use Envelope Encryption
Directly calling KMS to encrypt every field is a poor fit for chat workloads:
- KMS adds network latency.
- KMS requests are metered and can throttle.
- KMS has request-size limits.
- The app often needs to encrypt many small fields in a single request.
Envelope encryption fixes that by generating a DEK once and using local AES-GCM for the actual payload.
LLD Flow
sequenceDiagram
participant App
participant Cache as DEK Cache
participant KMS
participant DDB
App->>Cache: Request DEK for pii/session_turn
alt Cache hit and usage budget remains
Cache-->>App: Plaintext DEK + encrypted DEK
else Cache miss or expired
App->>KMS: GenerateDataKey(AES_256, encryption context)
KMS-->>App: Plaintext DEK + encrypted DEK
App->>Cache: Store DEK in memory only
end
App->>App: AES-GCM encrypt field with nonce + AAD
App->>DDB: Store ciphertext + encrypted DEK + metadata
Encryption Context Contract
We bind every KMS call to an encryption context so ciphertext cannot be trivially reused in a different workflow:
| Context Key | Example | Purpose |
|---|---|---|
app |
mangaassist |
Prevent cross-application confusion |
classification |
pii |
Distinguish sensitive data from generic app state |
record_type |
session_turn |
Limit reuse across entity types |
purpose |
chat_storage |
Separate storage from incident investigation or export |
tenant_scope |
jp-store |
Useful if the system expands across stores or locales |
Encrypted Field Shape
{
"alg": "AES-256-GCM",
"key_alias": "alias/mangaassist-pii",
"encrypted_data_key": "base64-kms-ciphertext",
"nonce": "base64-12-byte-nonce",
"ciphertext": "base64-aes-gcm-output",
"aad_sha256": "hex-digest-of-associated-data",
"encryption_context": {
"app": "mangaassist",
"classification": "pii",
"record_type": "session_turn",
"purpose": "chat_storage",
"tenant_scope": "jp-store"
},
"created_at": "2026-03-24T20:15:00Z"
}
Example Implementation
import base64
import hashlib
import json
import os
import time
from dataclasses import dataclass
import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
@dataclass
class CachedDataKey:
plaintext_key: bytes
encrypted_key: bytes
expires_at: float
uses_left: int
class PIIFieldEncryptor:
def __init__(self, kms_key_id: str, ttl_seconds: int = 900, max_uses: int = 1000):
self.kms = boto3.client("kms")
self.kms_key_id = kms_key_id
self.ttl_seconds = ttl_seconds
self.max_uses = max_uses
self._cache: dict[str, CachedDataKey] = {}
def _context(self, tenant_scope: str, record_type: str, purpose: str) -> dict[str, str]:
return {
"app": "mangaassist",
"classification": "pii",
"tenant_scope": tenant_scope,
"record_type": record_type,
"purpose": purpose,
}
def _cache_key(self, tenant_scope: str, record_type: str, purpose: str) -> str:
return f"{tenant_scope}:{record_type}:{purpose}"
def _get_data_key(self, tenant_scope: str, record_type: str, purpose: str) -> CachedDataKey:
now = time.time()
key = self._cache_key(tenant_scope, record_type, purpose)
entry = self._cache.get(key)
if entry and entry.expires_at > now and entry.uses_left > 0:
entry.uses_left -= 1
return entry
response = self.kms.generate_data_key(
KeyId=self.kms_key_id,
KeySpec="AES_256",
EncryptionContext=self._context(tenant_scope, record_type, purpose),
)
entry = CachedDataKey(
plaintext_key=response["Plaintext"],
encrypted_key=response["CiphertextBlob"],
expires_at=now + self.ttl_seconds,
uses_left=self.max_uses - 1,
)
self._cache[key] = entry
return entry
def encrypt_field(self, plaintext: str, tenant_scope: str, record_id: str) -> dict:
record_type = "session_turn"
purpose = "chat_storage"
context = self._context(tenant_scope, record_type, purpose)
entry = self._get_data_key(tenant_scope, record_type, purpose)
aad = json.dumps(
{
"record_id": record_id,
"tenant_scope": tenant_scope,
"record_type": record_type,
},
separators=(",", ":"),
sort_keys=True,
).encode()
nonce = os.urandom(12)
ciphertext = AESGCM(entry.plaintext_key).encrypt(nonce, plaintext.encode(), aad)
return {
"alg": "AES-256-GCM",
"key_alias": self.kms_key_id,
"encrypted_data_key": base64.b64encode(entry.encrypted_key).decode(),
"nonce": base64.b64encode(nonce).decode(),
"ciphertext": base64.b64encode(ciphertext).decode(),
"aad_sha256": hashlib.sha256(aad).hexdigest(),
"encryption_context": context,
"record_id": record_id,
"created_at": int(time.time()),
}
def decrypt_field(self, encrypted_blob: dict, tenant_scope: str, record_id: str) -> str:
context = encrypted_blob["encryption_context"]
aad = json.dumps(
{
"record_id": record_id,
"tenant_scope": tenant_scope,
"record_type": "session_turn",
},
separators=(",", ":"),
sort_keys=True,
).encode()
response = self.kms.decrypt(
CiphertextBlob=base64.b64decode(encrypted_blob["encrypted_data_key"]),
EncryptionContext=context,
)
plaintext_key = response["Plaintext"]
plaintext = AESGCM(plaintext_key).decrypt(
base64.b64decode(encrypted_blob["nonce"]),
base64.b64decode(encrypted_blob["ciphertext"]),
aad,
)
return plaintext.decode()
Implementation notes:
- AES-GCM gives confidentiality plus integrity.
- AAD binds the ciphertext to a record identifier so payload swapping becomes detectable.
- The cache is in memory only. It is never shared across Lambda instances and never written to Redis or disk.
- The app limits both time-based reuse and count-based reuse of a cached DEK.
DEK Cache Policy
| Parameter | Value | Why |
|---|---|---|
| Cache scope | Per warm runtime instance | Limits horizontal blast radius |
| TTL | 15 minutes | Low enough for containment, high enough to avoid KMS on every request |
| Max uses | 1,000 encryptions | Caps exposure if memory is compromised |
| Storage medium | Process memory only | No persistent plaintext key material |
| Cold start behavior | Fresh DEK request | Avoids stale or cross-instance reuse |
| Emergency mode | TTL forced to 0 | Lets security disable reuse during incident containment |
The exact values can move, but the policy needs both a time bound and a usage bound. Time alone is not enough during bursts, and usage alone is not enough during long-lived warm runtimes.
2. Store-Level Encryption by Service
DynamoDB Session Store
DynamoDB table encryption is mandatory, but it is not the whole story. Sensitive fields are already field-encrypted before the item is written.
Resources:
SessionsTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: mangaassist-sessions
BillingMode: PAY_PER_REQUEST
SSESpecification:
SSEEnabled: true
SSEType: KMS
KMSMasterKeyId: alias/mangaassist-app
Recommended item pattern:
- top-level non-sensitive attributes stay queryable
- sensitive fields are moved into an
encrypted_fieldsmap - the app stores classification metadata so retention and deletion workflows know what exists without decrypting it
S3 Audit Archive
Audit evidence has different requirements from chat history:
- it must be hard to tamper with
- it must be readable only by a narrow investigation path
- it often lives longer than session memory
Resources:
AuditBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: mangaassist-audit-logs
ObjectLockEnabled: true
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: aws:kms
KMSMasterKeyID: alias/mangaassist-audit
BucketKeyEnabled: true
Key S3 controls:
- deny
PutObjectunlessaws:kmsis used - deny writes without the expected audit key
- enable object lock or retention policy for evidence classes
- replicate to DR with destination-region key mapping
OpenSearch
OpenSearch should not become a side door around encryption:
- domain-level SSE-KMS enabled
- node-to-node encryption enabled
- fine-grained access control on index patterns
- raw PII not indexed
If search on a sensitive identifier is unavoidable, store a blind index instead of plaintext. For example:
email_lookup_hmac = HMAC_SHA256(separate_lookup_key, normalized_email)- search by exact HMAC match
- decrypt only the matching records later through the PII handler
This keeps exact-match lookups possible without turning the index into a plaintext leak.
CloudWatch Logs
Application logs should be redacted before emission, but the log groups still use KMS because logs often contain operational metadata, error payloads, and incident breadcrumbs.
Resources:
AppLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /aws/lambda/mangaassist-orchestrator
KmsKeyId: alias/mangaassist-audit
RetentionInDays: 30
Secrets Manager and Secure Parameters
Secrets are not configuration by another name. They have different rotation and access semantics.
from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3
secrets_client = boto3.client("secretsmanager")
cache = SecretCache(
config=SecretCacheConfig(max_cache_size=128, default_ttl=300),
client=secrets_client,
)
def get_downstream_secret(secret_id: str) -> str:
return cache.get_secret_string(secret_id)
The critical policy point is that application roles read secrets through the Secrets Manager API. They do not need broad kms:Decrypt on the secrets key outside that service path.
3. Transit Protection
Network Path View
flowchart LR
subgraph VPC["Private VPC"]
Lambda[Orchestrator / PII handler]
VPCEKMS[KMS VPC endpoint]
VPCES3[S3 VPC endpoint]
VPCEDDB[DynamoDB VPC endpoint]
VPCEBR[Bedrock VPC endpoint]
end
Lambda --> VPCEKMS --> KMS[KMS]
Lambda --> VPCES3 --> S3[S3]
Lambda --> VPCEDDB --> DDB[DynamoDB]
Lambda --> VPCEBR --> BR[Bedrock]
Transit Controls by Connection
| Connection | Primary Controls | Notes |
|---|---|---|
| User to API Gateway | TLS 1.2+ | Browser and mobile clients terminate at the edge |
| API Gateway to Lambda | AWS-managed internal transport | Not user-visible; still within AWS boundary |
| Lambda to KMS | TLS + private VPC endpoint + SigV4 | No public internet path |
| Lambda to DynamoDB/S3 | TLS + gateway endpoint | Lower latency and smaller exposure surface |
| Lambda to internal platform API | TLS + mTLS if platform-owned service | Needed only where we control both endpoints |
| OpenSearch client to domain | HTTPS enforced, TLS policy pinned | VPC-only endpoint preferred |
Transit encryption by itself is not enough, but it matters for two reasons:
- it prevents easy interception of payloads in motion
- private endpoints reduce the reachable network surface even when TLS is already present
4. IAM and Key Policy Pattern
The most important access-control design is not "allow app role to use KMS." It is "allow only the exact runtime that needs this key, under the exact context that proves the intent."
Runtime Roles
| Role | Allowed Crypto Actions | Explicitly Not Allowed |
|---|---|---|
mangaassist-orchestrator-role |
Read app secrets through Secrets Manager, use store-side encryption indirectly | Direct decrypt of PII ciphertext |
mangaassist-pii-handler-role |
GenerateDataKey, Decrypt on PII key with context restrictions |
Schedule deletion, disable key, read audit key |
mangaassist-audit-writer-role |
Put audit objects, use audit key for write path | Read audit objects, decrypt PII |
mangaassist-security-investigator-role |
Break-glass decrypt on PII or audit key with MFA and incident tag | Routine application use |
mangaassist-kms-admin-role |
Manage policies, rotation, aliases | Decrypt data payloads |
Restrictive PII Key Policy Pattern
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPIIHandlerEncryptDecrypt",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/mangaassist-pii-handler-role"
},
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"kms:EncryptionContext:app": "mangaassist",
"kms:EncryptionContext:classification": "pii",
"kms:EncryptionContext:purpose": "chat_storage"
}
}
},
{
"Sid": "AllowBreakGlassDecryptWithMFAAndIncidentTag",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/mangaassist-security-investigator-role"
},
"Action": [
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "*",
"Condition": {
"Bool": {
"aws:MultiFactorAuthPresent": "true"
},
"StringEquals": {
"aws:PrincipalTag/incident_approved": "true"
}
}
},
{
"Sid": "AllowKMSAdminsToManageButNotDecrypt",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/mangaassist-kms-admin-role"
},
"Action": [
"kms:CreateAlias",
"kms:UpdateAlias",
"kms:EnableKeyRotation",
"kms:PutKeyPolicy",
"kms:TagResource",
"kms:UntagResource",
"kms:DescribeKey"
],
"Resource": "*"
},
{
"Sid": "DenyDisableOrDeleteWithoutDedicatedWorkflow",
"Effect": "Deny",
"Principal": "*",
"Action": [
"kms:DisableKey",
"kms:ScheduleKeyDeletion"
],
"Resource": "*"
}
]
}
Why this matters:
- the app runtime gets only the minimum data-plane actions
- break-glass access becomes explicit and reviewable
- key administration does not automatically imply data visibility
Performance and Hot-Path Tuning
Encryption is only a good design if it keeps the product usable. The hot path changed after field-level PII encryption was introduced.
flowchart LR
subgraph Before["Before optimization"]
B1[Input] --> B2[PII detect]
B2 --> B3[Encrypt PII inline]
B3 --> B4[Guardrails]
B4 --> B5[Persist]
B5 --> B6[Respond]
end
subgraph After["After optimization"]
A1[Input] --> A2[PII detect]
A2 --> A3[Guardrails]
A3 --> A4[Respond]
A4 --> A5[Async encrypt and persist]
end
What Changed
| Stage | Before | After | Why It Improved |
|---|---|---|---|
| PII field encryption | Direct inline work for each sensitive write | Local AES-GCM with cached DEK | Fewer KMS calls |
| Persistence timing | Synchronous for all writes | Async for audit-heavy writes | Response no longer waits on long-tail persistence |
| KMS dependency | High fan-out | Controlled and cached | Lower throttle sensitivity |
The implementation rule is simple: encrypt before persistence, but not necessarily before the user gets a response if the persistence path itself can be made asynchronous and reliable.
Rotation, Revocation, and Lifecycle
Lifecycle View
stateDiagram-v2
[*] --> Created
Created --> Reviewed: policy and ownership validated
Reviewed --> Enabled
Enabled --> Rotating: automatic rotation or alias migration
Rotating --> Enabled
Enabled --> Restricted: incident containment
Restricted --> Enabled: risk cleared
Enabled --> Retired: no new encrypts, old decrypt only
Retired --> PendingDeletion: all ciphertext migrated or expired
PendingDeletion --> [*]
Three Different "Rotation" Operations
| Operation | What Changes | Key ID Changes | Requires Re-encryption | Typical Use |
|---|---|---|---|---|
| Automatic KMS rotation | Backing key material | No | No | Routine annual rotation |
| Alias swap to new CMK | Underlying CMK behind alias | Yes | Eventually, for old ciphertext if key retirement is required | Policy issue, ownership change, containment |
| Full re-encryption | Ciphertext rewritten with new DEKs and key | Usually yes | Yes | Confirmed compromise or migration |
This distinction is one of the most common interview follow-ups. A lot of weak answers treat all three as the same thing. They are not.
Scenario 2: Planned Annual Rotation Without Downtime
Routine rotation is intentionally boring.
sequenceDiagram
participant Sec as Security Platform
participant KMS
participant App as Application
participant Data as Existing Ciphertext
Sec->>KMS: Enable automatic rotation on CMK
KMS-->>Sec: Rotation schedule active
App->>KMS: Encrypt new data with same key ID
KMS-->>App: New backing material used
App->>KMS: Decrypt older ciphertext
KMS->>Data: Match older backing material internally
KMS-->>App: Plaintext returned
Operationally:
- Rotate in staging first and validate reads of pre-rotation ciphertext.
- Verify new writes still succeed with no code change.
- Confirm no custom code incorrectly pins a key version or key ARN that bypasses the alias.
- Enable automatic rotation in production.
Because the key ID does not change, the application and stores keep working. That is why routine rotation is low risk.
Scenario 3: Emergency Alias Swap After Suspected Key Misuse
Emergency rotation is not the same as annual rotation.
flowchart TD
Alert[Suspicious decrypt signal] --> Triage{False positive?}
Triage -->|Yes| Close[Close incident]
Triage -->|No| NewKey[Create new CMK]
NewKey --> Alias[Move alias to new CMK for future encrypts]
Alias --> Contain[Restrict old key to decrypt-only path]
Contain --> Migrate[Background re-encrypt high-risk records]
Migrate --> Validate[Validate no active writers use old key]
Validate --> Retire[Retire old key after evidence window]
Important nuance:
- new writes start using the new CMK immediately after the alias move
- old ciphertext still needs the old key to decrypt until migration finishes
- deleting or disabling the old key too early turns the incident into a self-inflicted outage
Scenario 4: Cached DEK Exposure in a Warm Runtime
This is a more realistic threat than "KMS master key stolen." The likely problem is a compromised process or role, not KMS itself.
flowchart LR
Detect[GuardDuty / EDR / abnormal behavior] --> Freeze[Set reserved concurrency to 0 or drain workers]
Freeze --> Revoke[Revoke role path or isolate function version]
Revoke --> Flush[Flush in-memory DEK caches by replacing runtimes]
Flush --> Harden[Set DEK cache TTL to 0 temporarily]
Harden --> Assess[Estimate records encrypted during exposure window]
Assess --> Reencrypt[Re-encrypt affected records if needed]
Reencrypt --> Restore[Restore normal traffic]
Key insight: automatic CMK rotation does not help if the leaked asset is a plaintext DEK already in memory. The right response is runtime containment plus selective re-encryption of data written under the exposed DEK window.
Disaster Recovery and Restore Design
Encryption adds a second restore dependency: you need the data and the ability to decrypt it in the recovery environment.
DR Dataflow
sequenceDiagram
participant Primary as Primary Region
participant PKMS as Primary Keys
participant Replica as DR Region Bucket
participant DKMS as DR Keys
participant Restore as Restore Workflow
Primary->>Replica: Replicate audit object or backup
PKMS-->>Primary: Source encryption path
Replica->>DKMS: Encrypt replica with destination-region key
Restore->>DKMS: Validate decrypt grants
Restore->>Replica: Read backup
Replica-->>Restore: Encrypted object
Restore->>DKMS: Decrypt in DR environment
DKMS-->>Restore: Plaintext for controlled restore flow
Recommended DR rules:
- Do not assume copied data is useful until decryption and alias mapping are tested.
- Keep restore automation in a separate account or pipeline from the normal app path.
- Test a real restore at least quarterly, including key grants and break-glass approvals.
- Record which ciphertext classes stay decryptable in DR and which are intentionally region-bound.
If the system later requires active-active multi-region chat traffic, this design can evolve toward multi-Region keys for specific replicated datasets. Until then, active-passive is simpler and easier to reason about.
Searching and Analytics Without Broad Decryption
A common mistake is adding strong encryption in storage and then undoing it in analytics by decrypting entire tables into a data lake.
Safer Patterns
| Need | Pattern | What We Avoid |
|---|---|---|
| Count PII detections | Use guardrail metadata counters | Decrypting raw conversations for reporting |
| Find a record by exact email | Store HMAC blind index using separate lookup key | Indexing plaintext email in OpenSearch |
| Investigate a single customer issue | Break-glass decrypt of just that record path | Exporting large decrypted datasets |
| Train operational metrics | Use redacted or tokenized fields | Copying full sensitive payloads into analytics |
Principle: analytics should operate on metadata, aggregates, tokens, or blind indexes. Decryption is for support or investigation workflows, not for routine dashboards.
Monitoring, Audit, and Detection
Encryption without observability becomes a false sense of safety. We monitor both success-path health and misuse-path signals.
| Signal | Source | Why It Matters | Trigger Example |
|---|---|---|---|
kms:Decrypt AccessDenied on PII key |
CloudTrail | Someone tried to read what they should not read | Any principal other than PII handler or break-glass role |
Sudden spike in GenerateDataKey calls |
CloudTrail / CloudWatch | Cache disabled, rollout bug, or abuse | 5x baseline for 10 minutes |
| KMS throttling | CloudWatch metrics | Hot path latency and request failures | P95 KMS latency above threshold |
DisableKey or alias update event |
CloudTrail | High-risk control-plane change | Any change outside approved pipeline window |
| Unencrypted or wrong-key S3 writes | S3 + CloudTrail | Audit evidence control drift | PutObject without expected SSE-KMS headers |
| Secret rotation failure | Secrets Manager | Aging secrets or broken clients | Rotation step fails twice |
| DEK cache miss ratio spike | App metrics | Performance regression or forced incident mode | Cache miss ratio doubles unexpectedly |
Example Detection Pipeline
flowchart LR
CT[CloudTrail KMS events] --> EB[EventBridge rules]
EB --> Lambda[Security detection Lambda]
Lambda --> Pager[Pager / ticket]
Lambda --> SIEM[SIEM timeline]
AppMetrics[Application metrics] --> Alarm[CloudWatch alarm]
Alarm --> Pager
Two rules matter a lot in practice:
- denied decrypt attempts are alerts, not "all good" signals
- alias changes are treated as security events even when they are legitimate, because they change the meaning of future encrypt operations
Scenario 5: Investigating Unauthorized Decrypt Attempts
This is the operational scenario interviewers often ask for after you explain the architecture.
Incident Flow
flowchart TD
Event[CloudTrail event: denied kms:Decrypt on PII key] --> Alert[Security alert created]
Alert --> Identify[Identify principal and deployment version]
Identify --> Inspect[Inspect code path and recent change]
Inspect --> Decision{Benign bug or malicious behavior?}
Decision -->|Bug| Fix[Remove decrypt call and redesign data access]
Decision -->|Malicious or unclear| Contain[Disable path, isolate role, preserve evidence]
Fix --> Guardrail[Add detection rule and review guardrails]
Contain --> Guardrail
Guardrail --> Close[Close with post-incident actions]
Example Investigation Narrative
- CloudTrail shows a denied
kms:Decryptattempt againstalias/mangaassist-pii. - The caller is an analytics batch role, which should never decrypt raw PII.
- Recent code added a metric job that tried to inspect encrypted session payloads directly.
- IAM blocked it, but the attempt still reveals a broken design assumption.
- The correct fix is to move the metric to guardrail metadata or blind indexes rather than widening decrypt access.
The design lesson is that IAM denial is the last line of defense, not the primary detection strategy. If the system is trying to decrypt where it should not, the architecture or the code path needs correction.
Operational Runbooks
Break-Glass Decrypt
Use this path only for narrow support or security investigations:
- Investigator receives an approved incident or support case ID.
- Temporary role session is tagged with
incident_approved=true. - MFA is required.
- Decrypt is limited to the minimum records needed.
- Every decrypt event is correlated with ticket ID, principal, and record ID.
- Session expires automatically.
"KMS Unavailable" Behavior
If KMS is degraded:
- new PII writes may fail closed if a fresh DEK is required and no cache is available
- existing cached DEKs allow a short grace window for encrypt operations already in flight
- decrypt-heavy support workflows should degrade before the customer chat path does
- audit logging should queue rather than write plaintext fallback files
The system should prefer temporary reduced functionality over silently persisting plaintext.
Deletion and Retention
Deletion logic needs metadata to know what must be erased:
- primary session record deleted or TTL-expired
- related blind indexes removed
- audit evidence retained only if policy requires it, otherwise deleted on schedule
- secrets rotated and old versions retired
- backups handled through retention windows rather than piecemeal mutation
Encryption is not a substitute for deletion, but it can reduce exposure during retention windows.
Testing Strategy
- Unit tests verify that decrypt fails when encryption context or AAD does not match.
- Integration tests verify DynamoDB, S3, and log groups reject writes with the wrong key policy.
- Rotation tests verify old ciphertext remains readable after routine KMS rotation.
- Incident drills simulate unauthorized decrypt attempts and validate alerting, containment, and audit evidence.
- DR tests restore encrypted backups into a recovery environment and validate actual decrypt permissions.
- Performance tests compare P95 latency with cache enabled, cache disabled, and forced cache TTL zero.
Good encryption design is operational only when these tests are routine, not theoretical.
Architecture Decisions and Tradeoffs
| Decision | What We Chose | Alternative | Upside | Downside |
|---|---|---|---|---|
| Key separation | Separate app, PII, audit, and secrets keys | One shared CMK | Smaller blast radius, better least privilege | More policy and lifecycle management |
| PII encryption style | Envelope encryption with AES-GCM and cached DEKs | Direct KMS per field | Lower latency and cost | In-memory DEKs require careful containment |
| Store encryption | SSE-KMS everywhere | Service-owned default encryption | Better auditability and key control | More operational overhead |
| Audit isolation | Separate audit key and S3 object lock | Reuse app key and mutable logs | Stronger forensic integrity | Harder access workflow |
| Search on sensitive IDs | Blind indexes only | Plaintext indexing | Reduces search-side leakage | Supports exact match only, not full-text |
| DR strategy | Active-passive with tested key mapping | Active-active from day one | Simpler operations | Slower failover and more manual preparation |
Follow-Up Questions and Deep-Dive Answers
Question 1: Why not use one CMK for everything? It is simpler.
Deep-dive answer: Simpler key inventory creates a more dangerous trust model. If the same CMK protects sessions, PII, audit logs, and secrets, then any policy error, overly broad grant, or investigative decrypt path becomes a cross-system exposure event. Separate keys let us align access with business purpose: app runtimes need app data, PII handlers need sensitive fields, investigators need audit evidence, and secrets access should flow through Secrets Manager. The extra key count is operational overhead, but it buys much tighter blast-radius control.
Question 2: Why not call KMS directly for every encrypt and decrypt?
Deep-dive answer: Because KMS is a trust anchor, not a per-field data plane. Direct KMS encryption increases latency, cost, and throttle sensitivity, especially in a chat product with many short fields. Envelope encryption keeps KMS in the key-distribution role while local AES-GCM handles bulk work efficiently. The control point stays strong because the DEK is still protected by KMS and bound to an encryption context. We accept limited in-memory DEK exposure in exchange for a practical hot path.
Question 3: What is the real risk of caching data keys in memory?
Deep-dive answer: The risk is not hypothetical. If a runtime is compromised, the attacker may extract a plaintext DEK from memory and decrypt data written under that DEK. That is why cache scope, TTL, and use count matter. We keep the cache per runtime instance, never share it, never persist it, and cap both age and reuse. During an incident we can drain workers and set TTL to zero. The design accepts a bounded local risk to avoid an unbounded availability and latency problem.
Question 4: How do you rotate keys without rewriting all old data?
Deep-dive answer: Routine KMS rotation does not change the key ID, so KMS retains older backing material and continues to decrypt old ciphertext transparently. No rewrite is needed. A rewrite becomes necessary only when we intentionally move to a brand-new CMK and want to retire the old one. That usually happens for policy changes, account moves, or incident containment, not for routine annual rotation.
Question 5: What if an analytics team wants to report on encrypted PII fields?
Deep-dive answer: The correct first answer is usually "they should not need raw PII." Analytics should consume metadata, counters, tokenized values, or blind indexes. If the ask is legitimate support lookup, we use a narrow decrypt workflow over a minimal dataset. If the ask is a dashboard, we redesign the data product. Pulling decrypted PII into an analytics environment defeats most of the point of encrypting it in the first place.
Question 6: How do you search encrypted data if a customer needs account help?
Deep-dive answer: We avoid full-text search on raw sensitive fields. For exact-match lookup cases like email or order reference, we store a blind index such as HMAC-SHA256 using a separate lookup key. That gives deterministic equality search without exposing plaintext in the index. Once a match is found, the support path can request controlled decrypt of the specific record if policy allows it. We trade search flexibility for a much tighter privacy posture.
Question 7: How do you stop key administrators from becoming data readers?
Deep-dive answer: By separating key administration from decrypt rights in both IAM and KMS policy. The KMS admin role can manage aliasing, tagging, and rotation settings, but it does not get kms:Decrypt. Break-glass investigation is a different role with MFA, ticket tagging, and short sessions. This is one of the clearest places to show mature separation of duties in a design review.
Question 8: What is your response if you see unauthorized decrypt attempts on the PII key?
Deep-dive answer: First confirm whether the attempts were denied or successful, but treat both as incidents. If denied, investigate the caller, deployment, and code path because something tried to cross a boundary it should not cross. If successful and unexpected, contain immediately: revoke or isolate the role, stop new traffic if needed, preserve logs, assess what records were accessible, and decide whether alias swap or re-encryption is required. The biggest mistake is dismissing denied attempts as harmless noise.
Question 9: What breaks first if KMS is slow or throttled?
Deep-dive answer: The first symptom is usually latency on workflows that need fresh DEKs or secret retrieval. With a healthy cache, the chat path can often tolerate a brief KMS degradation. Without cache, the system amplifies the problem into a customer-facing incident. That is why we measure KMS latency, DEK cache miss ratio, and secret cache miss ratio together. It is also why fail-closed logic needs queueing or retry behavior instead of plaintext fallback writes.
Question 10: How do backups and DR interact with encrypted data?
Deep-dive answer: Backups are only recoverable if the recovery environment can decrypt them. The DR design therefore includes destination-region keys, alias mapping, restore-role grants, and periodic restore tests. A common operational bug is to replicate data but forget to validate the decrypt path until a disaster. The DR story is incomplete unless you have demonstrated end-to-end restore of encrypted data in a separate environment.
Question 11: What would you change at 10x scale?
Deep-dive answer: I would not start by weakening encryption. I would first reduce avoidable decrypts, push more analytics to metadata, tune DEK cache policy carefully, add request batching where it preserves isolation, and consider broader use of async persistence. If the system becomes multi-region active-active, I would revisit which datasets need multi-Region keys and which should remain region-scoped. Scale pressure should push architectural clarity, not cryptographic shortcuts.
Question 12: What is the single most important implementation detail people forget?
Deep-dive answer: They forget that encryption context, alias policy, restore grants, and operational telemetry are part of the design, not afterthoughts. Many diagrams stop at "store uses KMS." Real systems fail on the edges: wrong role gets decrypt, old key disabled too soon, DR restore lacks grants, or a cache bug floods KMS. The implementation details around lifecycle and monitoring are where mature answers separate from surface-level ones.
Key Lessons
- Encryption architecture should mirror data sensitivity and access patterns, not just compliance labels.
- KMS-backed store encryption is necessary, but it does not replace field-level protection for truly sensitive data.
- Routine rotation, emergency alias swap, and full re-encryption are different operations with different risks.
- The practical risk surface is usually role misuse, logging mistakes, and restore failure, not broken AES.
- If the design cannot explain analytics, DR, and break-glass access, it is not a complete encryption design.
Cross-References
- PII data handling: 02-pii-protection-data-privacy.md
- Guardrail pipeline latency and async stages: 03-guardrails-pipeline-deep-dive.md
- Incident response and forensics: 05-incident-response-forensics.md
- System HLD: 04-architecture-hld.md
- System LLD: 04b-architecture-lld.md
- Security overview: 12-security-privacy.md