8. Encryption and Key Management

Encryption in MangaAssist is not a checkbox feature. It is the control set that limits blast radius when the application is wrong, when a role is over-permissioned, when logs are copied, when backups are restored, or when a downstream service gets compromised. The hard part is rarely "did we turn on AES-256?" The hard part is deciding where encryption belongs in the dataflow, which key should protect which asset, who is allowed to decrypt, how rotation happens without downtime, and how misuse gets detected before it becomes a breach.

This document goes deeper than the baseline architecture and answers seven engineering questions:

What exactly is encrypted in MangaAssist, and which key protects each data class?
How does the runtime use KMS without putting a network round trip on every field operation?
What does the high-level design look like, including trust boundaries and control-plane ownership?
What does the low-level implementation look like for envelope encryption, secrets retrieval, storage, and transit protection?
How do planned rotation, emergency rotation, break-glass access, and disaster recovery actually work?
What failure modes matter most in practice, and how do we monitor for them?
What follow-up questions should you expect in a design review or interview, and what are strong deep-dive answers?

Why This Matters for MangaAssist

MangaAssist handles several different kinds of data, and they should not all be protected the same way:

Data Class	Examples	Risk if Exposed	Encryption Pattern
Public product data	ASIN, title, author, format	Low direct privacy risk, moderate business integrity risk	Transport encryption only; no field-level encryption
Internal operational data	Session IDs, rate-limit counters, feature flags	Service abuse, internal recon	TLS + table/object encryption
Sensitive customer context	Name, email, shipping address, phone, order references	Direct privacy breach, fraud, legal exposure	Field-level envelope encryption + store-level encryption
Audit evidence	Prompt/response metadata, guardrail decisions, incident evidence	Tampering risk, forensic loss	Separate audit CMK + object lock + immutable retention
Secrets	API credentials, service auth material, webhook secrets	Credential theft, lateral movement	Secrets Manager or Parameter Store with dedicated CMK

The architecture therefore has to answer three different questions at once:

How do we keep data encrypted at rest and in transit?
How do we stop broad decryption rights from spreading across the system?
How do we preserve availability when rotation or incident response happens under load?

Design Goals

Use different keys for different trust zones so one policy mistake does not expose every store.
Keep the synchronous chat path fast enough for a responsive experience.
Make decryption rare, explicit, logged, and attributable to a narrow runtime role.
Separate key administration from data access so security admins do not automatically become data readers.
Support routine rotation with no downtime and emergency containment with a clear blast-radius model.
Ensure restore, replication, analytics, and deletion workflows remain possible after encryption is added.

Core Cryptography Concepts

Term	Meaning	Why It Matters Here
CMK / KEK	Customer-managed KMS key that protects other keys or service-side encryption	Defines access policy, audit trail, and blast radius
DEK	Data encryption key generated for actual payload encryption	Used locally for fast AES-GCM operations
Envelope encryption	KMS protects the DEK, application uses the DEK for data	Avoids a KMS call for every payload block
Encryption context	Non-secret metadata bound to a KMS operation	Prevents a ciphertext from being reused in the wrong workflow
Alias	Stable name that points to a KMS key	Lets applications stay constant while key backing changes
Grant	Narrow delegated permission for a service to use a KMS key	Useful for managed services and controlled automation
Bucket key	S3 optimization that reduces repeated KMS calls	Important for high-volume audit logging

Two design rules drive the rest of the document:

KMS is the root of trust, not the bulk encryption engine.
The application should only hold plaintext DEKs in memory, for a short time, in a narrow runtime.

Threat Model and Trust Boundaries

Main Failure Modes

Failure Mode	Example	Impact	First Control
Over-broad decryption rights	Analytics role gets `kms:Decrypt` on PII key	Silent privacy breach	Separate PII CMK + role isolation
KMS on hot path everywhere	One KMS call per PII field	Latency spike, throttling, cost growth	Envelope encryption + cache limits
Logs store sensitive data before redaction	Application logs raw address or email	Wide operational exposure	Pre-log redaction + log group KMS
Backup restore misses key permissions	Data restores, app cannot decrypt	Operational outage during DR	Restore playbook includes KMS grants validation
Key compromise response is too slow	Suspicious decrypts continue for hours	Larger blast radius	CloudTrail alerting + role containment runbook
Cached DEK leaked from runtime memory	Compromised warm container	Limited plaintext exposure	Per-instance cache, TTL, usage caps, rapid drain
Key admin can also read data	Same team manages keys and app roles	Separation-of-duties failure	Distinct admin and decrypt roles

Trust Boundary View

flowchart TB
    subgraph Untrusted["Untrusted / User-Controlled"]
        User[User message]
        Browser[Web or mobile client]
        Hist[Conversation text]
    end

    subgraph App["Application Decision Layer"]
        Gateway[API Gateway]
        Orch[Chat orchestrator]
        Guard[Guardrails and privacy policy]
        PII[PII encryption handler]
        SecretCache[Secret cache]
    end

    subgraph Crypto["Cryptographic Control Plane"]
        KMS[AWS KMS CMKs]
        SM[Secrets Manager]
        IAM[IAM roles and key policies]
        CT[CloudTrail]
    end

    subgraph Stores["Persistent Stores"]
        DDB[DynamoDB session store]
        S3[S3 audit archive]
        OS[OpenSearch index]
        CW[CloudWatch logs]
    end

    subgraph SecOps["Security Operations"]
        Alerts[EventBridge and alerts]
        SIEM[Security analytics]
    end

    Browser --> Gateway
    Hist --> Orch
    User --> Gateway
    Gateway --> Orch
    Orch --> Guard
    Guard --> PII
    Orch --> SecretCache
    SecretCache --> SM
    PII --> KMS
    SM --> KMS
    DDB --> KMS
    S3 --> KMS
    OS --> KMS
    CW --> KMS
    IAM --> KMS
    CT --> Alerts
    Alerts --> SIEM

Key boundary rule: raw sensitive fields may exist ephemerally in memory when needed to serve the request, but persistence boundaries should receive either redacted values or encrypted fields plus enough metadata to decrypt them later in a controlled path.

Control Matrix

Asset	Store	At-Rest Control	Field-Level Control	In-Transit Control	Decrypting Runtime	Primary Key
Session metadata	DynamoDB	SSE-KMS	None	TLS over VPC endpoint	Chat orchestrator	`alias/mangaassist-app`
Sensitive chat fields	DynamoDB	SSE-KMS	Envelope encryption with AES-256-GCM	TLS over VPC endpoint	PII handler only	`alias/mangaassist-pii`
Audit evidence	S3 + CloudWatch	SSE-KMS + object lock	Optional field tokenization only	TLS + VPC endpoint	Security investigator role	`alias/mangaassist-audit`
Search index fragments	OpenSearch	SSE-KMS + node-to-node encryption	Do not index raw PII	HTTPS enforced	Search service	`alias/mangaassist-app`
Secrets	Secrets Manager / SSM	SSE-KMS	N/A	TLS + SigV4	Service runtime through secrets API	`alias/mangaassist-secrets`
Cache entries	ElastiCache	At-rest encryption	No raw PII cache	In-transit encryption	Private app subnets only	`alias/mangaassist-app`

Two important consequences fall out of this table:

Some data is protected twice: once by the store and once at the field level.
Search and analytics are intentionally designed to avoid decrypting raw PII whenever possible.

High-Level Design (HLD)

System Overview

graph TB
    subgraph Client["Client and Edge"]
        U[User]
        FE[Web or mobile client]
        GW[API Gateway]
    end

    subgraph Runtime["MangaAssist Runtime"]
        AUTH[Auth and session]
        ORCH[Chat orchestrator]
        GRD[Guardrails and privacy policy]
        PII[PII encryption service]
        AUD[Async audit writer]
    end

    subgraph AI["Model and Service Layer"]
        LLM[Bedrock model]
        CAT[Catalog and order services]
        REC[Recommendation services]
    end

    subgraph Data["Persistent Stores"]
        DDB[DynamoDB sessions]
        S3[S3 audit archive]
        OS[OpenSearch]
        CW[CloudWatch logs]
        SEC[Secrets Manager]
    end

    subgraph Control["Crypto and Security Control Plane"]
        KMS[KMS key ring]
        IAM[IAM and key policies]
        CT[CloudTrail]
        EVT[EventBridge alerts]
    end

    subgraph DR["Recovery"]
        DRS3[Replica audit bucket]
        DRKMS[DR region keys]
    end

    U --> FE --> GW
    GW --> AUTH
    GW --> ORCH
    AUTH --> ORCH
    ORCH --> GRD
    GRD --> LLM
    ORCH --> CAT
    ORCH --> REC
    ORCH --> PII
    ORCH --> DDB
    ORCH --> OS
    ORCH --> AUD
    AUD --> S3
    ORCH --> CW
    ORCH --> SEC
    PII --> KMS
    DDB --> KMS
    S3 --> KMS
    OS --> KMS
    CW --> KMS
    SEC --> KMS
    IAM --> KMS
    CT --> EVT
    S3 --> DRS3
    DRS3 --> DRKMS

HLD Principles

The data plane and key control plane are separate. The app can request cryptographic operations, but policy authority remains in KMS and IAM.
PII decryption is concentrated in one narrow runtime path instead of being spread across the orchestrator, analytics, indexing, and logging pipelines.
Managed store encryption is always enabled, but it is not treated as a substitute for field-level protection of high-sensitivity data.
Aliases give stable integration points for applications, while underlying keys can rotate or be replaced.
Audit evidence uses a different key from the chat session store because forensic access patterns and retention rules are different.
Disaster recovery planning includes keys, grants, aliases, and restore scripts, not just data copies.

Key Hierarchy

Key	Alias	Protected Assets	Rotation Style	Who Can Administer	Who Can Decrypt
App key	`alias/mangaassist-app`	DynamoDB table SSE, OpenSearch, cache, low-sensitivity app data	KMS automatic annual rotation	Security platform	Managed services and orchestrator runtime where needed
PII key	`alias/mangaassist-pii`	Field-level encrypted names, emails, addresses, phone numbers, order references	KMS automatic annual rotation; emergency alias swap if incident	Security platform only	PII handler runtime, break-glass investigator role
Audit key	`alias/mangaassist-audit`	S3 audit archive, immutable evidence, security log groups	KMS automatic annual rotation; cross-region replica pairing	Security platform	Audit reader or incident responder only
Secrets key	`alias/mangaassist-secrets`	Secrets Manager, secure parameters, rotation metadata	KMS automatic annual rotation	Security platform	Secrets Manager service path and approved runtimes

The major design choice is that alias/mangaassist-pii is not reused for logs, search, or secrets. That keeps the highest-risk decrypt path isolated.

Scenario 1: Normal Request Dataflow

The standard request path matters because encryption design is only useful if it fits the latency budget.

sequenceDiagram
    participant User
    participant Gateway
    participant Orch as Orchestrator
    participant Guard as Guardrails
    participant Secrets as Secrets Manager
    participant KMS
    participant PII as PII Encryptor
    participant DDB
    participant Audit as Audit Writer
    participant S3

    User->>Gateway: Chat message
    Gateway->>Orch: Authenticated request
    Orch->>Secrets: Get downstream credentials if cache miss
    Secrets->>KMS: Decrypt secret version
    KMS-->>Secrets: Plaintext secret in service memory
    Secrets-->>Orch: Secret value
    Orch->>Guard: Build prompt with approved fields only
    Guard-->>Orch: Safe model request
    Orch->>PII: Encrypt fields that may persist
    PII->>KMS: GenerateDataKey(classification=pii, purpose=chat_storage)
    KMS-->>PII: Plaintext DEK + encrypted DEK
    PII-->>Orch: Ciphertext payload + metadata
    Orch->>DDB: Write session item with field ciphertext
    Orch-->>Gateway: User response
    Gateway-->>User: Final response
    Orch->>Audit: Async audit event
    Audit->>S3: Store immutable audit object with SSE-KMS

Synchronous vs Asynchronous Split

Path Segment	Why It Stays on the Hot Path	Why It Might Move Off the Hot Path
Secret retrieval	Needed to call downstream APIs	Use in-memory secret cache to reduce repeated reads
Field encryption before persistence	Required before writing sensitive data	Keep local AES work in process; avoid repeated KMS calls
Audit archival	Not needed to answer the user	Push to async writer and S3
Re-encryption migrations	Correctness, not immediate response	Always background

The big practical lesson is that encryption belongs close to persistence boundaries, not in the middle of every intermediate transformation stage.

Low-Level Design (LLD)

1. Envelope Encryption for PII Fields

Why We Use Envelope Encryption

Directly calling KMS to encrypt every field is a poor fit for chat workloads:

KMS adds network latency.
KMS requests are metered and can throttle.
KMS has request-size limits.
The app often needs to encrypt many small fields in a single request.

Envelope encryption fixes that by generating a DEK once and using local AES-GCM for the actual payload.

LLD Flow

sequenceDiagram
    participant App
    participant Cache as DEK Cache
    participant KMS
    participant DDB

    App->>Cache: Request DEK for pii/session_turn
    alt Cache hit and usage budget remains
        Cache-->>App: Plaintext DEK + encrypted DEK
    else Cache miss or expired
        App->>KMS: GenerateDataKey(AES_256, encryption context)
        KMS-->>App: Plaintext DEK + encrypted DEK
        App->>Cache: Store DEK in memory only
    end
    App->>App: AES-GCM encrypt field with nonce + AAD
    App->>DDB: Store ciphertext + encrypted DEK + metadata

Encryption Context Contract

We bind every KMS call to an encryption context so ciphertext cannot be trivially reused in a different workflow:

Context Key	Example	Purpose
`app`	`mangaassist`	Prevent cross-application confusion
`classification`	`pii`	Distinguish sensitive data from generic app state
`record_type`	`session_turn`	Limit reuse across entity types
`purpose`	`chat_storage`	Separate storage from incident investigation or export
`tenant_scope`	`jp-store`	Useful if the system expands across stores or locales

Encrypted Field Shape

{
  "alg": "AES-256-GCM",
  "key_alias": "alias/mangaassist-pii",
  "encrypted_data_key": "base64-kms-ciphertext",
  "nonce": "base64-12-byte-nonce",
  "ciphertext": "base64-aes-gcm-output",
  "aad_sha256": "hex-digest-of-associated-data",
  "encryption_context": {
    "app": "mangaassist",
    "classification": "pii",
    "record_type": "session_turn",
    "purpose": "chat_storage",
    "tenant_scope": "jp-store"
  },
  "created_at": "2026-03-24T20:15:00Z"
}

Example Implementation

import base64
import hashlib
import json
import os
import time
from dataclasses import dataclass

import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM


@dataclass
class CachedDataKey:
    plaintext_key: bytes
    encrypted_key: bytes
    expires_at: float
    uses_left: int


class PIIFieldEncryptor:
    def __init__(self, kms_key_id: str, ttl_seconds: int = 900, max_uses: int = 1000):
        self.kms = boto3.client("kms")
        self.kms_key_id = kms_key_id
        self.ttl_seconds = ttl_seconds
        self.max_uses = max_uses
        self._cache: dict[str, CachedDataKey] = {}

    def _context(self, tenant_scope: str, record_type: str, purpose: str) -> dict[str, str]:
        return {
            "app": "mangaassist",
            "classification": "pii",
            "tenant_scope": tenant_scope,
            "record_type": record_type,
            "purpose": purpose,
        }

    def _cache_key(self, tenant_scope: str, record_type: str, purpose: str) -> str:
        return f"{tenant_scope}:{record_type}:{purpose}"

    def _get_data_key(self, tenant_scope: str, record_type: str, purpose: str) -> CachedDataKey:
        now = time.time()
        key = self._cache_key(tenant_scope, record_type, purpose)
        entry = self._cache.get(key)

        if entry and entry.expires_at > now and entry.uses_left > 0:
            entry.uses_left -= 1
            return entry

        response = self.kms.generate_data_key(
            KeyId=self.kms_key_id,
            KeySpec="AES_256",
            EncryptionContext=self._context(tenant_scope, record_type, purpose),
        )

        entry = CachedDataKey(
            plaintext_key=response["Plaintext"],
            encrypted_key=response["CiphertextBlob"],
            expires_at=now + self.ttl_seconds,
            uses_left=self.max_uses - 1,
        )
        self._cache[key] = entry
        return entry

    def encrypt_field(self, plaintext: str, tenant_scope: str, record_id: str) -> dict:
        record_type = "session_turn"
        purpose = "chat_storage"
        context = self._context(tenant_scope, record_type, purpose)
        entry = self._get_data_key(tenant_scope, record_type, purpose)

        aad = json.dumps(
            {
                "record_id": record_id,
                "tenant_scope": tenant_scope,
                "record_type": record_type,
            },
            separators=(",", ":"),
            sort_keys=True,
        ).encode()

        nonce = os.urandom(12)
        ciphertext = AESGCM(entry.plaintext_key).encrypt(nonce, plaintext.encode(), aad)

        return {
            "alg": "AES-256-GCM",
            "key_alias": self.kms_key_id,
            "encrypted_data_key": base64.b64encode(entry.encrypted_key).decode(),
            "nonce": base64.b64encode(nonce).decode(),
            "ciphertext": base64.b64encode(ciphertext).decode(),
            "aad_sha256": hashlib.sha256(aad).hexdigest(),
            "encryption_context": context,
            "record_id": record_id,
            "created_at": int(time.time()),
        }

    def decrypt_field(self, encrypted_blob: dict, tenant_scope: str, record_id: str) -> str:
        context = encrypted_blob["encryption_context"]
        aad = json.dumps(
            {
                "record_id": record_id,
                "tenant_scope": tenant_scope,
                "record_type": "session_turn",
            },
            separators=(",", ":"),
            sort_keys=True,
        ).encode()

        response = self.kms.decrypt(
            CiphertextBlob=base64.b64decode(encrypted_blob["encrypted_data_key"]),
            EncryptionContext=context,
        )

        plaintext_key = response["Plaintext"]
        plaintext = AESGCM(plaintext_key).decrypt(
            base64.b64decode(encrypted_blob["nonce"]),
            base64.b64decode(encrypted_blob["ciphertext"]),
            aad,
        )
        return plaintext.decode()

Implementation notes:

AES-GCM gives confidentiality plus integrity.
AAD binds the ciphertext to a record identifier so payload swapping becomes detectable.
The cache is in memory only. It is never shared across Lambda instances and never written to Redis or disk.
The app limits both time-based reuse and count-based reuse of a cached DEK.

DEK Cache Policy

Parameter	Value	Why
Cache scope	Per warm runtime instance	Limits horizontal blast radius
TTL	15 minutes	Low enough for containment, high enough to avoid KMS on every request
Max uses	1,000 encryptions	Caps exposure if memory is compromised
Storage medium	Process memory only	No persistent plaintext key material
Cold start behavior	Fresh DEK request	Avoids stale or cross-instance reuse
Emergency mode	TTL forced to 0	Lets security disable reuse during incident containment

The exact values can move, but the policy needs both a time bound and a usage bound. Time alone is not enough during bursts, and usage alone is not enough during long-lived warm runtimes.

2. Store-Level Encryption by Service

DynamoDB Session Store

DynamoDB table encryption is mandatory, but it is not the whole story. Sensitive fields are already field-encrypted before the item is written.

Resources:
  SessionsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: mangaassist-sessions
      BillingMode: PAY_PER_REQUEST
      SSESpecification:
        SSEEnabled: true
        SSEType: KMS
        KMSMasterKeyId: alias/mangaassist-app

S3 Audit Archive

Audit evidence has different requirements from chat history:

it must be hard to tamper with
it must be readable only by a narrow investigation path
it often lives longer than session memory

Resources:
  AuditBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: mangaassist-audit-logs
      ObjectLockEnabled: true
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: alias/mangaassist-audit
            BucketKeyEnabled: true

Key S3 controls:

deny PutObject unless aws:kms is used
deny writes without the expected audit key
enable object lock or retention policy for evidence classes
replicate to DR with destination-region key mapping

OpenSearch

OpenSearch should not become a side door around encryption:

domain-level SSE-KMS enabled
node-to-node encryption enabled
fine-grained access control on index patterns
raw PII not indexed

If search on a sensitive identifier is unavoidable, store a blind index instead of plaintext. For example:

email_lookup_hmac = HMAC_SHA256(separate_lookup_key, normalized_email)
search by exact HMAC match
decrypt only the matching records later through the PII handler

This keeps exact-match lookups possible without turning the index into a plaintext leak.

CloudWatch Logs

Application logs should be redacted before emission, but the log groups still use KMS because logs often contain operational metadata, error payloads, and incident breadcrumbs.

Resources:
  AppLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /aws/lambda/mangaassist-orchestrator
      KmsKeyId: alias/mangaassist-audit
      RetentionInDays: 30

Secrets Manager and Secure Parameters

Secrets are not configuration by another name. They have different rotation and access semantics.

from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3

secrets_client = boto3.client("secretsmanager")
cache = SecretCache(
    config=SecretCacheConfig(max_cache_size=128, default_ttl=300),
    client=secrets_client,
)

def get_downstream_secret(secret_id: str) -> str:
    return cache.get_secret_string(secret_id)

The critical policy point is that application roles read secrets through the Secrets Manager API. They do not need broad kms:Decrypt on the secrets key outside that service path.

3. Transit Protection

Network Path View

flowchart LR
    subgraph VPC["Private VPC"]
        Lambda[Orchestrator / PII handler]
        VPCEKMS[KMS VPC endpoint]
        VPCES3[S3 VPC endpoint]
        VPCEDDB[DynamoDB VPC endpoint]
        VPCEBR[Bedrock VPC endpoint]
    end

    Lambda --> VPCEKMS --> KMS[KMS]
    Lambda --> VPCES3 --> S3[S3]
    Lambda --> VPCEDDB --> DDB[DynamoDB]
    Lambda --> VPCEBR --> BR[Bedrock]

Transit Controls by Connection

Connection	Primary Controls	Notes
User to API Gateway	TLS 1.2+	Browser and mobile clients terminate at the edge
API Gateway to Lambda	AWS-managed internal transport	Not user-visible; still within AWS boundary
Lambda to KMS	TLS + private VPC endpoint + SigV4	No public internet path
Lambda to DynamoDB/S3	TLS + gateway endpoint	Lower latency and smaller exposure surface
Lambda to internal platform API	TLS + mTLS if platform-owned service	Needed only where we control both endpoints
OpenSearch client to domain	HTTPS enforced, TLS policy pinned	VPC-only endpoint preferred

Transit encryption by itself is not enough, but it matters for two reasons:

it prevents easy interception of payloads in motion
private endpoints reduce the reachable network surface even when TLS is already present

4. IAM and Key Policy Pattern

The most important access-control design is not "allow app role to use KMS." It is "allow only the exact runtime that needs this key, under the exact context that proves the intent."

Runtime Roles

Role	Allowed Crypto Actions	Explicitly Not Allowed
`mangaassist-orchestrator-role`	Read app secrets through Secrets Manager, use store-side encryption indirectly	Direct decrypt of PII ciphertext
`mangaassist-pii-handler-role`	`GenerateDataKey`, `Decrypt` on PII key with context restrictions	Schedule deletion, disable key, read audit key
`mangaassist-audit-writer-role`	Put audit objects, use audit key for write path	Read audit objects, decrypt PII
`mangaassist-security-investigator-role`	Break-glass decrypt on PII or audit key with MFA and incident tag	Routine application use
`mangaassist-kms-admin-role`	Manage policies, rotation, aliases	Decrypt data payloads

Restrictive PII Key Policy Pattern

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowPIIHandlerEncryptDecrypt",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-pii-handler-role"
      },
      "Action": [
        "kms:GenerateDataKey",
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "kms:EncryptionContext:app": "mangaassist",
          "kms:EncryptionContext:classification": "pii",
          "kms:EncryptionContext:purpose": "chat_storage"
        }
      }
    },
    {
      "Sid": "AllowBreakGlassDecryptWithMFAAndIncidentTag",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-security-investigator-role"
      },
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        },
        "StringEquals": {
          "aws:PrincipalTag/incident_approved": "true"
        }
      }
    },
    {
      "Sid": "AllowKMSAdminsToManageButNotDecrypt",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-kms-admin-role"
      },
      "Action": [
        "kms:CreateAlias",
        "kms:UpdateAlias",
        "kms:EnableKeyRotation",
        "kms:PutKeyPolicy",
        "kms:TagResource",
        "kms:UntagResource",
        "kms:DescribeKey"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DenyDisableOrDeleteWithoutDedicatedWorkflow",
      "Effect": "Deny",
      "Principal": "*",
      "Action": [
        "kms:DisableKey",
        "kms:ScheduleKeyDeletion"
      ],
      "Resource": "*"
    }
  ]
}

Why this matters:

the app runtime gets only the minimum data-plane actions
break-glass access becomes explicit and reviewable
key administration does not automatically imply data visibility

Performance and Hot-Path Tuning

Encryption is only a good design if it keeps the product usable. The hot path changed after field-level PII encryption was introduced.

flowchart LR
    subgraph Before["Before optimization"]
        B1[Input] --> B2[PII detect]
        B2 --> B3[Encrypt PII inline]
        B3 --> B4[Guardrails]
        B4 --> B5[Persist]
        B5 --> B6[Respond]
    end

    subgraph After["After optimization"]
        A1[Input] --> A2[PII detect]
        A2 --> A3[Guardrails]
        A3 --> A4[Respond]
        A4 --> A5[Async encrypt and persist]
    end

What Changed

Stage	Before	After	Why It Improved
PII field encryption	Direct inline work for each sensitive write	Local AES-GCM with cached DEK	Fewer KMS calls
Persistence timing	Synchronous for all writes	Async for audit-heavy writes	Response no longer waits on long-tail persistence
KMS dependency	High fan-out	Controlled and cached	Lower throttle sensitivity

The implementation rule is simple: encrypt before persistence, but not necessarily before the user gets a response if the persistence path itself can be made asynchronous and reliable.

Rotation, Revocation, and Lifecycle

Lifecycle View

stateDiagram-v2
    [*] --> Created
    Created --> Reviewed: policy and ownership validated
    Reviewed --> Enabled
    Enabled --> Rotating: automatic rotation or alias migration
    Rotating --> Enabled
    Enabled --> Restricted: incident containment
    Restricted --> Enabled: risk cleared
    Enabled --> Retired: no new encrypts, old decrypt only
    Retired --> PendingDeletion: all ciphertext migrated or expired
    PendingDeletion --> [*]

Three Different "Rotation" Operations

Operation	What Changes	Key ID Changes	Requires Re-encryption	Typical Use
Automatic KMS rotation	Backing key material	No	No	Routine annual rotation
Alias swap to new CMK	Underlying CMK behind alias	Yes	Eventually, for old ciphertext if key retirement is required	Policy issue, ownership change, containment
Full re-encryption	Ciphertext rewritten with new DEKs and key	Usually yes	Yes	Confirmed compromise or migration

This distinction is one of the most common interview follow-ups. A lot of weak answers treat all three as the same thing. They are not.

Scenario 2: Planned Annual Rotation Without Downtime

Routine rotation is intentionally boring.

sequenceDiagram
    participant Sec as Security Platform
    participant KMS
    participant App as Application
    participant Data as Existing Ciphertext

    Sec->>KMS: Enable automatic rotation on CMK
    KMS-->>Sec: Rotation schedule active
    App->>KMS: Encrypt new data with same key ID
    KMS-->>App: New backing material used
    App->>KMS: Decrypt older ciphertext
    KMS->>Data: Match older backing material internally
    KMS-->>App: Plaintext returned

Operationally:

Rotate in staging first and validate reads of pre-rotation ciphertext.
Verify new writes still succeed with no code change.
Confirm no custom code incorrectly pins a key version or key ARN that bypasses the alias.
Enable automatic rotation in production.

Because the key ID does not change, the application and stores keep working. That is why routine rotation is low risk.

Scenario 3: Emergency Alias Swap After Suspected Key Misuse

Emergency rotation is not the same as annual rotation.

flowchart TD
    Alert[Suspicious decrypt signal] --> Triage{False positive?}
    Triage -->|Yes| Close[Close incident]
    Triage -->|No| NewKey[Create new CMK]
    NewKey --> Alias[Move alias to new CMK for future encrypts]
    Alias --> Contain[Restrict old key to decrypt-only path]
    Contain --> Migrate[Background re-encrypt high-risk records]
    Migrate --> Validate[Validate no active writers use old key]
    Validate --> Retire[Retire old key after evidence window]

Important nuance:

new writes start using the new CMK immediately after the alias move
old ciphertext still needs the old key to decrypt until migration finishes
deleting or disabling the old key too early turns the incident into a self-inflicted outage

Scenario 4: Cached DEK Exposure in a Warm Runtime

This is a more realistic threat than "KMS master key stolen." The likely problem is a compromised process or role, not KMS itself.

flowchart LR
    Detect[GuardDuty / EDR / abnormal behavior] --> Freeze[Set reserved concurrency to 0 or drain workers]
    Freeze --> Revoke[Revoke role path or isolate function version]
    Revoke --> Flush[Flush in-memory DEK caches by replacing runtimes]
    Flush --> Harden[Set DEK cache TTL to 0 temporarily]
    Harden --> Assess[Estimate records encrypted during exposure window]
    Assess --> Reencrypt[Re-encrypt affected records if needed]
    Reencrypt --> Restore[Restore normal traffic]

Key insight: automatic CMK rotation does not help if the leaked asset is a plaintext DEK already in memory. The right response is runtime containment plus selective re-encryption of data written under the exposed DEK window.

Disaster Recovery and Restore Design

Encryption adds a second restore dependency: you need the data and the ability to decrypt it in the recovery environment.

DR Dataflow

sequenceDiagram
    participant Primary as Primary Region
    participant PKMS as Primary Keys
    participant Replica as DR Region Bucket
    participant DKMS as DR Keys
    participant Restore as Restore Workflow

    Primary->>Replica: Replicate audit object or backup
    PKMS-->>Primary: Source encryption path
    Replica->>DKMS: Encrypt replica with destination-region key
    Restore->>DKMS: Validate decrypt grants
    Restore->>Replica: Read backup
    Replica-->>Restore: Encrypted object
    Restore->>DKMS: Decrypt in DR environment
    DKMS-->>Restore: Plaintext for controlled restore flow

Recommended DR rules:

Do not assume copied data is useful until decryption and alias mapping are tested.
Keep restore automation in a separate account or pipeline from the normal app path.
Test a real restore at least quarterly, including key grants and break-glass approvals.
Record which ciphertext classes stay decryptable in DR and which are intentionally region-bound.

If the system later requires active-active multi-region chat traffic, this design can evolve toward multi-Region keys for specific replicated datasets. Until then, active-passive is simpler and easier to reason about.

Searching and Analytics Without Broad Decryption

A common mistake is adding strong encryption in storage and then undoing it in analytics by decrypting entire tables into a data lake.

Safer Patterns

Need	Pattern	What We Avoid
Count PII detections	Use guardrail metadata counters	Decrypting raw conversations for reporting
Find a record by exact email	Store HMAC blind index using separate lookup key	Indexing plaintext email in OpenSearch
Investigate a single customer issue	Break-glass decrypt of just that record path	Exporting large decrypted datasets
Train operational metrics	Use redacted or tokenized fields	Copying full sensitive payloads into analytics

Principle: analytics should operate on metadata, aggregates, tokens, or blind indexes. Decryption is for support or investigation workflows, not for routine dashboards.

Monitoring, Audit, and Detection

Encryption without observability becomes a false sense of safety. We monitor both success-path health and misuse-path signals.

Signal	Source	Why It Matters	Trigger Example
`kms:Decrypt` AccessDenied on PII key	CloudTrail	Someone tried to read what they should not read	Any principal other than PII handler or break-glass role
Sudden spike in `GenerateDataKey` calls	CloudTrail / CloudWatch	Cache disabled, rollout bug, or abuse	5x baseline for 10 minutes
KMS throttling	CloudWatch metrics	Hot path latency and request failures	P95 KMS latency above threshold
`DisableKey` or alias update event	CloudTrail	High-risk control-plane change	Any change outside approved pipeline window
Unencrypted or wrong-key S3 writes	S3 + CloudTrail	Audit evidence control drift	`PutObject` without expected SSE-KMS headers
Secret rotation failure	Secrets Manager	Aging secrets or broken clients	Rotation step fails twice
DEK cache miss ratio spike	App metrics	Performance regression or forced incident mode	Cache miss ratio doubles unexpectedly

Example Detection Pipeline

flowchart LR
    CT[CloudTrail KMS events] --> EB[EventBridge rules]
    EB --> Lambda[Security detection Lambda]
    Lambda --> Pager[Pager / ticket]
    Lambda --> SIEM[SIEM timeline]
    AppMetrics[Application metrics] --> Alarm[CloudWatch alarm]
    Alarm --> Pager

Two rules matter a lot in practice:

denied decrypt attempts are alerts, not "all good" signals
alias changes are treated as security events even when they are legitimate, because they change the meaning of future encrypt operations

Scenario 5: Investigating Unauthorized Decrypt Attempts

This is the operational scenario interviewers often ask for after you explain the architecture.

Incident Flow

flowchart TD
    Event[CloudTrail event: denied kms:Decrypt on PII key] --> Alert[Security alert created]
    Alert --> Identify[Identify principal and deployment version]
    Identify --> Inspect[Inspect code path and recent change]
    Inspect --> Decision{Benign bug or malicious behavior?}
    Decision -->|Bug| Fix[Remove decrypt call and redesign data access]
    Decision -->|Malicious or unclear| Contain[Disable path, isolate role, preserve evidence]
    Fix --> Guardrail[Add detection rule and review guardrails]
    Contain --> Guardrail
    Guardrail --> Close[Close with post-incident actions]

Example Investigation Narrative

CloudTrail shows a denied kms:Decrypt attempt against alias/mangaassist-pii.
The caller is an analytics batch role, which should never decrypt raw PII.
Recent code added a metric job that tried to inspect encrypted session payloads directly.
IAM blocked it, but the attempt still reveals a broken design assumption.
The correct fix is to move the metric to guardrail metadata or blind indexes rather than widening decrypt access.

The design lesson is that IAM denial is the last line of defense, not the primary detection strategy. If the system is trying to decrypt where it should not, the architecture or the code path needs correction.

Operational Runbooks

Break-Glass Decrypt

Use this path only for narrow support or security investigations:

Investigator receives an approved incident or support case ID.
Temporary role session is tagged with incident_approved=true.
MFA is required.
Decrypt is limited to the minimum records needed.
Every decrypt event is correlated with ticket ID, principal, and record ID.
Session expires automatically.

"KMS Unavailable" Behavior

If KMS is degraded:

new PII writes may fail closed if a fresh DEK is required and no cache is available
existing cached DEKs allow a short grace window for encrypt operations already in flight
decrypt-heavy support workflows should degrade before the customer chat path does
audit logging should queue rather than write plaintext fallback files

The system should prefer temporary reduced functionality over silently persisting plaintext.

Deletion and Retention

Deletion logic needs metadata to know what must be erased:

primary session record deleted or TTL-expired
related blind indexes removed
audit evidence retained only if policy requires it, otherwise deleted on schedule
secrets rotated and old versions retired
backups handled through retention windows rather than piecemeal mutation

Encryption is not a substitute for deletion, but it can reduce exposure during retention windows.

Testing Strategy

Unit tests verify that decrypt fails when encryption context or AAD does not match.
Integration tests verify DynamoDB, S3, and log groups reject writes with the wrong key policy.
Rotation tests verify old ciphertext remains readable after routine KMS rotation.
Incident drills simulate unauthorized decrypt attempts and validate alerting, containment, and audit evidence.
DR tests restore encrypted backups into a recovery environment and validate actual decrypt permissions.
Performance tests compare P95 latency with cache enabled, cache disabled, and forced cache TTL zero.

Good encryption design is operational only when these tests are routine, not theoretical.

Architecture Decisions and Tradeoffs

Decision	What We Chose	Alternative	Upside	Downside
Key separation	Separate app, PII, audit, and secrets keys	One shared CMK	Smaller blast radius, better least privilege	More policy and lifecycle management
PII encryption style	Envelope encryption with AES-GCM and cached DEKs	Direct KMS per field	Lower latency and cost	In-memory DEKs require careful containment
Store encryption	SSE-KMS everywhere	Service-owned default encryption	Better auditability and key control	More operational overhead
Audit isolation	Separate audit key and S3 object lock	Reuse app key and mutable logs	Stronger forensic integrity	Harder access workflow
Search on sensitive IDs	Blind indexes only	Plaintext indexing	Reduces search-side leakage	Supports exact match only, not full-text
DR strategy	Active-passive with tested key mapping	Active-active from day one	Simpler operations	Slower failover and more manual preparation

Follow-Up Questions and Deep-Dive Answers

Question 1: Why not use one CMK for everything? It is simpler.

Deep-dive answer: Simpler key inventory creates a more dangerous trust model. If the same CMK protects sessions, PII, audit logs, and secrets, then any policy error, overly broad grant, or investigative decrypt path becomes a cross-system exposure event. Separate keys let us align access with business purpose: app runtimes need app data, PII handlers need sensitive fields, investigators need audit evidence, and secrets access should flow through Secrets Manager. The extra key count is operational overhead, but it buys much tighter blast-radius control.

Question 2: Why not call KMS directly for every encrypt and decrypt?

Deep-dive answer: Because KMS is a trust anchor, not a per-field data plane. Direct KMS encryption increases latency, cost, and throttle sensitivity, especially in a chat product with many short fields. Envelope encryption keeps KMS in the key-distribution role while local AES-GCM handles bulk work efficiently. The control point stays strong because the DEK is still protected by KMS and bound to an encryption context. We accept limited in-memory DEK exposure in exchange for a practical hot path.

Question 3: What is the real risk of caching data keys in memory?

Deep-dive answer: The risk is not hypothetical. If a runtime is compromised, the attacker may extract a plaintext DEK from memory and decrypt data written under that DEK. That is why cache scope, TTL, and use count matter. We keep the cache per runtime instance, never share it, never persist it, and cap both age and reuse. During an incident we can drain workers and set TTL to zero. The design accepts a bounded local risk to avoid an unbounded availability and latency problem.

Question 4: How do you rotate keys without rewriting all old data?

Deep-dive answer: Routine KMS rotation does not change the key ID, so KMS retains older backing material and continues to decrypt old ciphertext transparently. No rewrite is needed. A rewrite becomes necessary only when we intentionally move to a brand-new CMK and want to retire the old one. That usually happens for policy changes, account moves, or incident containment, not for routine annual rotation.

Question 5: What if an analytics team wants to report on encrypted PII fields?

Deep-dive answer: The correct first answer is usually "they should not need raw PII." Analytics should consume metadata, counters, tokenized values, or blind indexes. If the ask is legitimate support lookup, we use a narrow decrypt workflow over a minimal dataset. If the ask is a dashboard, we redesign the data product. Pulling decrypted PII into an analytics environment defeats most of the point of encrypting it in the first place.

Question 6: How do you search encrypted data if a customer needs account help?

Deep-dive answer: We avoid full-text search on raw sensitive fields. For exact-match lookup cases like email or order reference, we store a blind index such as HMAC-SHA256 using a separate lookup key. That gives deterministic equality search without exposing plaintext in the index. Once a match is found, the support path can request controlled decrypt of the specific record if policy allows it. We trade search flexibility for a much tighter privacy posture.

Question 7: How do you stop key administrators from becoming data readers?

Deep-dive answer: By separating key administration from decrypt rights in both IAM and KMS policy. The KMS admin role can manage aliasing, tagging, and rotation settings, but it does not get kms:Decrypt. Break-glass investigation is a different role with MFA, ticket tagging, and short sessions. This is one of the clearest places to show mature separation of duties in a design review.

Question 8: What is your response if you see unauthorized decrypt attempts on the PII key?

Deep-dive answer: First confirm whether the attempts were denied or successful, but treat both as incidents. If denied, investigate the caller, deployment, and code path because something tried to cross a boundary it should not cross. If successful and unexpected, contain immediately: revoke or isolate the role, stop new traffic if needed, preserve logs, assess what records were accessible, and decide whether alias swap or re-encryption is required. The biggest mistake is dismissing denied attempts as harmless noise.

Question 9: What breaks first if KMS is slow or throttled?

Deep-dive answer: The first symptom is usually latency on workflows that need fresh DEKs or secret retrieval. With a healthy cache, the chat path can often tolerate a brief KMS degradation. Without cache, the system amplifies the problem into a customer-facing incident. That is why we measure KMS latency, DEK cache miss ratio, and secret cache miss ratio together. It is also why fail-closed logic needs queueing or retry behavior instead of plaintext fallback writes.

Question 10: How do backups and DR interact with encrypted data?

Deep-dive answer: Backups are only recoverable if the recovery environment can decrypt them. The DR design therefore includes destination-region keys, alias mapping, restore-role grants, and periodic restore tests. A common operational bug is to replicate data but forget to validate the decrypt path until a disaster. The DR story is incomplete unless you have demonstrated end-to-end restore of encrypted data in a separate environment.

Question 11: What would you change at 10x scale?

Deep-dive answer: I would not start by weakening encryption. I would first reduce avoidable decrypts, push more analytics to metadata, tune DEK cache policy carefully, add request batching where it preserves isolation, and consider broader use of async persistence. If the system becomes multi-region active-active, I would revisit which datasets need multi-Region keys and which should remain region-scoped. Scale pressure should push architectural clarity, not cryptographic shortcuts.

Question 12: What is the single most important implementation detail people forget?

Deep-dive answer: They forget that encryption context, alias policy, restore grants, and operational telemetry are part of the design, not afterthoughts. Many diagrams stop at "store uses KMS." Real systems fail on the edges: wrong role gets decrypt, old key disabled too soon, DR restore lacks grants, or a cache bug floods KMS. The implementation details around lifecycle and monitoring are where mature answers separate from surface-level ones.

Key Lessons

Encryption architecture should mirror data sensitivity and access patterns, not just compliance labels.
KMS-backed store encryption is necessary, but it does not replace field-level protection for truly sensitive data.
Routine rotation, emergency alias swap, and full re-encryption are different operations with different risks.
The practical risk surface is usually role misuse, logging mistakes, and restore failure, not broken AES.
If the design cannot explain analytics, DR, and break-glass access, it is not a complete encryption design.

Cross-References

PII data handling: 02-pii-protection-data-privacy.md
Guardrail pipeline latency and async stages: 03-guardrails-pipeline-deep-dive.md
Incident response and forensics: 05-incident-response-forensics.md
System HLD: 04-architecture-hld.md
System LLD: 04b-architecture-lld.md
Security overview: 12-security-privacy.md