LOCAL PREVIEW View on GitHub

8. Encryption and Key Management

Encryption in MangaAssist is not a checkbox feature. It is the control set that limits blast radius when the application is wrong, when a role is over-permissioned, when logs are copied, when backups are restored, or when a downstream service gets compromised. The hard part is rarely "did we turn on AES-256?" The hard part is deciding where encryption belongs in the dataflow, which key should protect which asset, who is allowed to decrypt, how rotation happens without downtime, and how misuse gets detected before it becomes a breach.

This document goes deeper than the baseline architecture and answers seven engineering questions:

  1. What exactly is encrypted in MangaAssist, and which key protects each data class?
  2. How does the runtime use KMS without putting a network round trip on every field operation?
  3. What does the high-level design look like, including trust boundaries and control-plane ownership?
  4. What does the low-level implementation look like for envelope encryption, secrets retrieval, storage, and transit protection?
  5. How do planned rotation, emergency rotation, break-glass access, and disaster recovery actually work?
  6. What failure modes matter most in practice, and how do we monitor for them?
  7. What follow-up questions should you expect in a design review or interview, and what are strong deep-dive answers?

Why This Matters for MangaAssist

MangaAssist handles several different kinds of data, and they should not all be protected the same way:

Data Class Examples Risk if Exposed Encryption Pattern
Public product data ASIN, title, author, format Low direct privacy risk, moderate business integrity risk Transport encryption only; no field-level encryption
Internal operational data Session IDs, rate-limit counters, feature flags Service abuse, internal recon TLS + table/object encryption
Sensitive customer context Name, email, shipping address, phone, order references Direct privacy breach, fraud, legal exposure Field-level envelope encryption + store-level encryption
Audit evidence Prompt/response metadata, guardrail decisions, incident evidence Tampering risk, forensic loss Separate audit CMK + object lock + immutable retention
Secrets API credentials, service auth material, webhook secrets Credential theft, lateral movement Secrets Manager or Parameter Store with dedicated CMK

The architecture therefore has to answer three different questions at once:

  • How do we keep data encrypted at rest and in transit?
  • How do we stop broad decryption rights from spreading across the system?
  • How do we preserve availability when rotation or incident response happens under load?

Design Goals

  1. Use different keys for different trust zones so one policy mistake does not expose every store.
  2. Keep the synchronous chat path fast enough for a responsive experience.
  3. Make decryption rare, explicit, logged, and attributable to a narrow runtime role.
  4. Separate key administration from data access so security admins do not automatically become data readers.
  5. Support routine rotation with no downtime and emergency containment with a clear blast-radius model.
  6. Ensure restore, replication, analytics, and deletion workflows remain possible after encryption is added.

Core Cryptography Concepts

Term Meaning Why It Matters Here
CMK / KEK Customer-managed KMS key that protects other keys or service-side encryption Defines access policy, audit trail, and blast radius
DEK Data encryption key generated for actual payload encryption Used locally for fast AES-GCM operations
Envelope encryption KMS protects the DEK, application uses the DEK for data Avoids a KMS call for every payload block
Encryption context Non-secret metadata bound to a KMS operation Prevents a ciphertext from being reused in the wrong workflow
Alias Stable name that points to a KMS key Lets applications stay constant while key backing changes
Grant Narrow delegated permission for a service to use a KMS key Useful for managed services and controlled automation
Bucket key S3 optimization that reduces repeated KMS calls Important for high-volume audit logging

Two design rules drive the rest of the document:

  • KMS is the root of trust, not the bulk encryption engine.
  • The application should only hold plaintext DEKs in memory, for a short time, in a narrow runtime.

Threat Model and Trust Boundaries

Main Failure Modes

Failure Mode Example Impact First Control
Over-broad decryption rights Analytics role gets kms:Decrypt on PII key Silent privacy breach Separate PII CMK + role isolation
KMS on hot path everywhere One KMS call per PII field Latency spike, throttling, cost growth Envelope encryption + cache limits
Logs store sensitive data before redaction Application logs raw address or email Wide operational exposure Pre-log redaction + log group KMS
Backup restore misses key permissions Data restores, app cannot decrypt Operational outage during DR Restore playbook includes KMS grants validation
Key compromise response is too slow Suspicious decrypts continue for hours Larger blast radius CloudTrail alerting + role containment runbook
Cached DEK leaked from runtime memory Compromised warm container Limited plaintext exposure Per-instance cache, TTL, usage caps, rapid drain
Key admin can also read data Same team manages keys and app roles Separation-of-duties failure Distinct admin and decrypt roles

Trust Boundary View

flowchart TB
    subgraph Untrusted["Untrusted / User-Controlled"]
        User[User message]
        Browser[Web or mobile client]
        Hist[Conversation text]
    end

    subgraph App["Application Decision Layer"]
        Gateway[API Gateway]
        Orch[Chat orchestrator]
        Guard[Guardrails and privacy policy]
        PII[PII encryption handler]
        SecretCache[Secret cache]
    end

    subgraph Crypto["Cryptographic Control Plane"]
        KMS[AWS KMS CMKs]
        SM[Secrets Manager]
        IAM[IAM roles and key policies]
        CT[CloudTrail]
    end

    subgraph Stores["Persistent Stores"]
        DDB[DynamoDB session store]
        S3[S3 audit archive]
        OS[OpenSearch index]
        CW[CloudWatch logs]
    end

    subgraph SecOps["Security Operations"]
        Alerts[EventBridge and alerts]
        SIEM[Security analytics]
    end

    Browser --> Gateway
    Hist --> Orch
    User --> Gateway
    Gateway --> Orch
    Orch --> Guard
    Guard --> PII
    Orch --> SecretCache
    SecretCache --> SM
    PII --> KMS
    SM --> KMS
    DDB --> KMS
    S3 --> KMS
    OS --> KMS
    CW --> KMS
    IAM --> KMS
    CT --> Alerts
    Alerts --> SIEM

Key boundary rule: raw sensitive fields may exist ephemerally in memory when needed to serve the request, but persistence boundaries should receive either redacted values or encrypted fields plus enough metadata to decrypt them later in a controlled path.


Control Matrix

Asset Store At-Rest Control Field-Level Control In-Transit Control Decrypting Runtime Primary Key
Session metadata DynamoDB SSE-KMS None TLS over VPC endpoint Chat orchestrator alias/mangaassist-app
Sensitive chat fields DynamoDB SSE-KMS Envelope encryption with AES-256-GCM TLS over VPC endpoint PII handler only alias/mangaassist-pii
Audit evidence S3 + CloudWatch SSE-KMS + object lock Optional field tokenization only TLS + VPC endpoint Security investigator role alias/mangaassist-audit
Search index fragments OpenSearch SSE-KMS + node-to-node encryption Do not index raw PII HTTPS enforced Search service alias/mangaassist-app
Secrets Secrets Manager / SSM SSE-KMS N/A TLS + SigV4 Service runtime through secrets API alias/mangaassist-secrets
Cache entries ElastiCache At-rest encryption No raw PII cache In-transit encryption Private app subnets only alias/mangaassist-app

Two important consequences fall out of this table:

  • Some data is protected twice: once by the store and once at the field level.
  • Search and analytics are intentionally designed to avoid decrypting raw PII whenever possible.

High-Level Design (HLD)

System Overview

graph TB
    subgraph Client["Client and Edge"]
        U[User]
        FE[Web or mobile client]
        GW[API Gateway]
    end

    subgraph Runtime["MangaAssist Runtime"]
        AUTH[Auth and session]
        ORCH[Chat orchestrator]
        GRD[Guardrails and privacy policy]
        PII[PII encryption service]
        AUD[Async audit writer]
    end

    subgraph AI["Model and Service Layer"]
        LLM[Bedrock model]
        CAT[Catalog and order services]
        REC[Recommendation services]
    end

    subgraph Data["Persistent Stores"]
        DDB[DynamoDB sessions]
        S3[S3 audit archive]
        OS[OpenSearch]
        CW[CloudWatch logs]
        SEC[Secrets Manager]
    end

    subgraph Control["Crypto and Security Control Plane"]
        KMS[KMS key ring]
        IAM[IAM and key policies]
        CT[CloudTrail]
        EVT[EventBridge alerts]
    end

    subgraph DR["Recovery"]
        DRS3[Replica audit bucket]
        DRKMS[DR region keys]
    end

    U --> FE --> GW
    GW --> AUTH
    GW --> ORCH
    AUTH --> ORCH
    ORCH --> GRD
    GRD --> LLM
    ORCH --> CAT
    ORCH --> REC
    ORCH --> PII
    ORCH --> DDB
    ORCH --> OS
    ORCH --> AUD
    AUD --> S3
    ORCH --> CW
    ORCH --> SEC
    PII --> KMS
    DDB --> KMS
    S3 --> KMS
    OS --> KMS
    CW --> KMS
    SEC --> KMS
    IAM --> KMS
    CT --> EVT
    S3 --> DRS3
    DRS3 --> DRKMS

HLD Principles

  1. The data plane and key control plane are separate. The app can request cryptographic operations, but policy authority remains in KMS and IAM.
  2. PII decryption is concentrated in one narrow runtime path instead of being spread across the orchestrator, analytics, indexing, and logging pipelines.
  3. Managed store encryption is always enabled, but it is not treated as a substitute for field-level protection of high-sensitivity data.
  4. Aliases give stable integration points for applications, while underlying keys can rotate or be replaced.
  5. Audit evidence uses a different key from the chat session store because forensic access patterns and retention rules are different.
  6. Disaster recovery planning includes keys, grants, aliases, and restore scripts, not just data copies.

Key Hierarchy

Key Alias Protected Assets Rotation Style Who Can Administer Who Can Decrypt
App key alias/mangaassist-app DynamoDB table SSE, OpenSearch, cache, low-sensitivity app data KMS automatic annual rotation Security platform Managed services and orchestrator runtime where needed
PII key alias/mangaassist-pii Field-level encrypted names, emails, addresses, phone numbers, order references KMS automatic annual rotation; emergency alias swap if incident Security platform only PII handler runtime, break-glass investigator role
Audit key alias/mangaassist-audit S3 audit archive, immutable evidence, security log groups KMS automatic annual rotation; cross-region replica pairing Security platform Audit reader or incident responder only
Secrets key alias/mangaassist-secrets Secrets Manager, secure parameters, rotation metadata KMS automatic annual rotation Security platform Secrets Manager service path and approved runtimes

The major design choice is that alias/mangaassist-pii is not reused for logs, search, or secrets. That keeps the highest-risk decrypt path isolated.


Scenario 1: Normal Request Dataflow

The standard request path matters because encryption design is only useful if it fits the latency budget.

sequenceDiagram
    participant User
    participant Gateway
    participant Orch as Orchestrator
    participant Guard as Guardrails
    participant Secrets as Secrets Manager
    participant KMS
    participant PII as PII Encryptor
    participant DDB
    participant Audit as Audit Writer
    participant S3

    User->>Gateway: Chat message
    Gateway->>Orch: Authenticated request
    Orch->>Secrets: Get downstream credentials if cache miss
    Secrets->>KMS: Decrypt secret version
    KMS-->>Secrets: Plaintext secret in service memory
    Secrets-->>Orch: Secret value
    Orch->>Guard: Build prompt with approved fields only
    Guard-->>Orch: Safe model request
    Orch->>PII: Encrypt fields that may persist
    PII->>KMS: GenerateDataKey(classification=pii, purpose=chat_storage)
    KMS-->>PII: Plaintext DEK + encrypted DEK
    PII-->>Orch: Ciphertext payload + metadata
    Orch->>DDB: Write session item with field ciphertext
    Orch-->>Gateway: User response
    Gateway-->>User: Final response
    Orch->>Audit: Async audit event
    Audit->>S3: Store immutable audit object with SSE-KMS

Synchronous vs Asynchronous Split

Path Segment Why It Stays on the Hot Path Why It Might Move Off the Hot Path
Secret retrieval Needed to call downstream APIs Use in-memory secret cache to reduce repeated reads
Field encryption before persistence Required before writing sensitive data Keep local AES work in process; avoid repeated KMS calls
Audit archival Not needed to answer the user Push to async writer and S3
Re-encryption migrations Correctness, not immediate response Always background

The big practical lesson is that encryption belongs close to persistence boundaries, not in the middle of every intermediate transformation stage.


Low-Level Design (LLD)

1. Envelope Encryption for PII Fields

Why We Use Envelope Encryption

Directly calling KMS to encrypt every field is a poor fit for chat workloads:

  • KMS adds network latency.
  • KMS requests are metered and can throttle.
  • KMS has request-size limits.
  • The app often needs to encrypt many small fields in a single request.

Envelope encryption fixes that by generating a DEK once and using local AES-GCM for the actual payload.

LLD Flow

sequenceDiagram
    participant App
    participant Cache as DEK Cache
    participant KMS
    participant DDB

    App->>Cache: Request DEK for pii/session_turn
    alt Cache hit and usage budget remains
        Cache-->>App: Plaintext DEK + encrypted DEK
    else Cache miss or expired
        App->>KMS: GenerateDataKey(AES_256, encryption context)
        KMS-->>App: Plaintext DEK + encrypted DEK
        App->>Cache: Store DEK in memory only
    end
    App->>App: AES-GCM encrypt field with nonce + AAD
    App->>DDB: Store ciphertext + encrypted DEK + metadata

Encryption Context Contract

We bind every KMS call to an encryption context so ciphertext cannot be trivially reused in a different workflow:

Context Key Example Purpose
app mangaassist Prevent cross-application confusion
classification pii Distinguish sensitive data from generic app state
record_type session_turn Limit reuse across entity types
purpose chat_storage Separate storage from incident investigation or export
tenant_scope jp-store Useful if the system expands across stores or locales

Encrypted Field Shape

{
  "alg": "AES-256-GCM",
  "key_alias": "alias/mangaassist-pii",
  "encrypted_data_key": "base64-kms-ciphertext",
  "nonce": "base64-12-byte-nonce",
  "ciphertext": "base64-aes-gcm-output",
  "aad_sha256": "hex-digest-of-associated-data",
  "encryption_context": {
    "app": "mangaassist",
    "classification": "pii",
    "record_type": "session_turn",
    "purpose": "chat_storage",
    "tenant_scope": "jp-store"
  },
  "created_at": "2026-03-24T20:15:00Z"
}

Example Implementation

import base64
import hashlib
import json
import os
import time
from dataclasses import dataclass

import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM


@dataclass
class CachedDataKey:
    plaintext_key: bytes
    encrypted_key: bytes
    expires_at: float
    uses_left: int


class PIIFieldEncryptor:
    def __init__(self, kms_key_id: str, ttl_seconds: int = 900, max_uses: int = 1000):
        self.kms = boto3.client("kms")
        self.kms_key_id = kms_key_id
        self.ttl_seconds = ttl_seconds
        self.max_uses = max_uses
        self._cache: dict[str, CachedDataKey] = {}

    def _context(self, tenant_scope: str, record_type: str, purpose: str) -> dict[str, str]:
        return {
            "app": "mangaassist",
            "classification": "pii",
            "tenant_scope": tenant_scope,
            "record_type": record_type,
            "purpose": purpose,
        }

    def _cache_key(self, tenant_scope: str, record_type: str, purpose: str) -> str:
        return f"{tenant_scope}:{record_type}:{purpose}"

    def _get_data_key(self, tenant_scope: str, record_type: str, purpose: str) -> CachedDataKey:
        now = time.time()
        key = self._cache_key(tenant_scope, record_type, purpose)
        entry = self._cache.get(key)

        if entry and entry.expires_at > now and entry.uses_left > 0:
            entry.uses_left -= 1
            return entry

        response = self.kms.generate_data_key(
            KeyId=self.kms_key_id,
            KeySpec="AES_256",
            EncryptionContext=self._context(tenant_scope, record_type, purpose),
        )

        entry = CachedDataKey(
            plaintext_key=response["Plaintext"],
            encrypted_key=response["CiphertextBlob"],
            expires_at=now + self.ttl_seconds,
            uses_left=self.max_uses - 1,
        )
        self._cache[key] = entry
        return entry

    def encrypt_field(self, plaintext: str, tenant_scope: str, record_id: str) -> dict:
        record_type = "session_turn"
        purpose = "chat_storage"
        context = self._context(tenant_scope, record_type, purpose)
        entry = self._get_data_key(tenant_scope, record_type, purpose)

        aad = json.dumps(
            {
                "record_id": record_id,
                "tenant_scope": tenant_scope,
                "record_type": record_type,
            },
            separators=(",", ":"),
            sort_keys=True,
        ).encode()

        nonce = os.urandom(12)
        ciphertext = AESGCM(entry.plaintext_key).encrypt(nonce, plaintext.encode(), aad)

        return {
            "alg": "AES-256-GCM",
            "key_alias": self.kms_key_id,
            "encrypted_data_key": base64.b64encode(entry.encrypted_key).decode(),
            "nonce": base64.b64encode(nonce).decode(),
            "ciphertext": base64.b64encode(ciphertext).decode(),
            "aad_sha256": hashlib.sha256(aad).hexdigest(),
            "encryption_context": context,
            "record_id": record_id,
            "created_at": int(time.time()),
        }

    def decrypt_field(self, encrypted_blob: dict, tenant_scope: str, record_id: str) -> str:
        context = encrypted_blob["encryption_context"]
        aad = json.dumps(
            {
                "record_id": record_id,
                "tenant_scope": tenant_scope,
                "record_type": "session_turn",
            },
            separators=(",", ":"),
            sort_keys=True,
        ).encode()

        response = self.kms.decrypt(
            CiphertextBlob=base64.b64decode(encrypted_blob["encrypted_data_key"]),
            EncryptionContext=context,
        )

        plaintext_key = response["Plaintext"]
        plaintext = AESGCM(plaintext_key).decrypt(
            base64.b64decode(encrypted_blob["nonce"]),
            base64.b64decode(encrypted_blob["ciphertext"]),
            aad,
        )
        return plaintext.decode()

Implementation notes:

  • AES-GCM gives confidentiality plus integrity.
  • AAD binds the ciphertext to a record identifier so payload swapping becomes detectable.
  • The cache is in memory only. It is never shared across Lambda instances and never written to Redis or disk.
  • The app limits both time-based reuse and count-based reuse of a cached DEK.

DEK Cache Policy

Parameter Value Why
Cache scope Per warm runtime instance Limits horizontal blast radius
TTL 15 minutes Low enough for containment, high enough to avoid KMS on every request
Max uses 1,000 encryptions Caps exposure if memory is compromised
Storage medium Process memory only No persistent plaintext key material
Cold start behavior Fresh DEK request Avoids stale or cross-instance reuse
Emergency mode TTL forced to 0 Lets security disable reuse during incident containment

The exact values can move, but the policy needs both a time bound and a usage bound. Time alone is not enough during bursts, and usage alone is not enough during long-lived warm runtimes.

2. Store-Level Encryption by Service

DynamoDB Session Store

DynamoDB table encryption is mandatory, but it is not the whole story. Sensitive fields are already field-encrypted before the item is written.

Resources:
  SessionsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: mangaassist-sessions
      BillingMode: PAY_PER_REQUEST
      SSESpecification:
        SSEEnabled: true
        SSEType: KMS
        KMSMasterKeyId: alias/mangaassist-app

Recommended item pattern:

  • top-level non-sensitive attributes stay queryable
  • sensitive fields are moved into an encrypted_fields map
  • the app stores classification metadata so retention and deletion workflows know what exists without decrypting it

S3 Audit Archive

Audit evidence has different requirements from chat history:

  • it must be hard to tamper with
  • it must be readable only by a narrow investigation path
  • it often lives longer than session memory
Resources:
  AuditBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: mangaassist-audit-logs
      ObjectLockEnabled: true
      BucketEncryption:
        ServerSideEncryptionConfiguration:
          - ServerSideEncryptionByDefault:
              SSEAlgorithm: aws:kms
              KMSMasterKeyID: alias/mangaassist-audit
            BucketKeyEnabled: true

Key S3 controls:

  • deny PutObject unless aws:kms is used
  • deny writes without the expected audit key
  • enable object lock or retention policy for evidence classes
  • replicate to DR with destination-region key mapping

OpenSearch

OpenSearch should not become a side door around encryption:

  • domain-level SSE-KMS enabled
  • node-to-node encryption enabled
  • fine-grained access control on index patterns
  • raw PII not indexed

If search on a sensitive identifier is unavoidable, store a blind index instead of plaintext. For example:

  • email_lookup_hmac = HMAC_SHA256(separate_lookup_key, normalized_email)
  • search by exact HMAC match
  • decrypt only the matching records later through the PII handler

This keeps exact-match lookups possible without turning the index into a plaintext leak.

CloudWatch Logs

Application logs should be redacted before emission, but the log groups still use KMS because logs often contain operational metadata, error payloads, and incident breadcrumbs.

Resources:
  AppLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /aws/lambda/mangaassist-orchestrator
      KmsKeyId: alias/mangaassist-audit
      RetentionInDays: 30

Secrets Manager and Secure Parameters

Secrets are not configuration by another name. They have different rotation and access semantics.

from aws_secretsmanager_caching import SecretCache, SecretCacheConfig
import boto3

secrets_client = boto3.client("secretsmanager")
cache = SecretCache(
    config=SecretCacheConfig(max_cache_size=128, default_ttl=300),
    client=secrets_client,
)

def get_downstream_secret(secret_id: str) -> str:
    return cache.get_secret_string(secret_id)

The critical policy point is that application roles read secrets through the Secrets Manager API. They do not need broad kms:Decrypt on the secrets key outside that service path.

3. Transit Protection

Network Path View

flowchart LR
    subgraph VPC["Private VPC"]
        Lambda[Orchestrator / PII handler]
        VPCEKMS[KMS VPC endpoint]
        VPCES3[S3 VPC endpoint]
        VPCEDDB[DynamoDB VPC endpoint]
        VPCEBR[Bedrock VPC endpoint]
    end

    Lambda --> VPCEKMS --> KMS[KMS]
    Lambda --> VPCES3 --> S3[S3]
    Lambda --> VPCEDDB --> DDB[DynamoDB]
    Lambda --> VPCEBR --> BR[Bedrock]

Transit Controls by Connection

Connection Primary Controls Notes
User to API Gateway TLS 1.2+ Browser and mobile clients terminate at the edge
API Gateway to Lambda AWS-managed internal transport Not user-visible; still within AWS boundary
Lambda to KMS TLS + private VPC endpoint + SigV4 No public internet path
Lambda to DynamoDB/S3 TLS + gateway endpoint Lower latency and smaller exposure surface
Lambda to internal platform API TLS + mTLS if platform-owned service Needed only where we control both endpoints
OpenSearch client to domain HTTPS enforced, TLS policy pinned VPC-only endpoint preferred

Transit encryption by itself is not enough, but it matters for two reasons:

  • it prevents easy interception of payloads in motion
  • private endpoints reduce the reachable network surface even when TLS is already present

4. IAM and Key Policy Pattern

The most important access-control design is not "allow app role to use KMS." It is "allow only the exact runtime that needs this key, under the exact context that proves the intent."

Runtime Roles

Role Allowed Crypto Actions Explicitly Not Allowed
mangaassist-orchestrator-role Read app secrets through Secrets Manager, use store-side encryption indirectly Direct decrypt of PII ciphertext
mangaassist-pii-handler-role GenerateDataKey, Decrypt on PII key with context restrictions Schedule deletion, disable key, read audit key
mangaassist-audit-writer-role Put audit objects, use audit key for write path Read audit objects, decrypt PII
mangaassist-security-investigator-role Break-glass decrypt on PII or audit key with MFA and incident tag Routine application use
mangaassist-kms-admin-role Manage policies, rotation, aliases Decrypt data payloads

Restrictive PII Key Policy Pattern

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowPIIHandlerEncryptDecrypt",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-pii-handler-role"
      },
      "Action": [
        "kms:GenerateDataKey",
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "kms:EncryptionContext:app": "mangaassist",
          "kms:EncryptionContext:classification": "pii",
          "kms:EncryptionContext:purpose": "chat_storage"
        }
      }
    },
    {
      "Sid": "AllowBreakGlassDecryptWithMFAAndIncidentTag",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-security-investigator-role"
      },
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        },
        "StringEquals": {
          "aws:PrincipalTag/incident_approved": "true"
        }
      }
    },
    {
      "Sid": "AllowKMSAdminsToManageButNotDecrypt",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/mangaassist-kms-admin-role"
      },
      "Action": [
        "kms:CreateAlias",
        "kms:UpdateAlias",
        "kms:EnableKeyRotation",
        "kms:PutKeyPolicy",
        "kms:TagResource",
        "kms:UntagResource",
        "kms:DescribeKey"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DenyDisableOrDeleteWithoutDedicatedWorkflow",
      "Effect": "Deny",
      "Principal": "*",
      "Action": [
        "kms:DisableKey",
        "kms:ScheduleKeyDeletion"
      ],
      "Resource": "*"
    }
  ]
}

Why this matters:

  • the app runtime gets only the minimum data-plane actions
  • break-glass access becomes explicit and reviewable
  • key administration does not automatically imply data visibility

Performance and Hot-Path Tuning

Encryption is only a good design if it keeps the product usable. The hot path changed after field-level PII encryption was introduced.

flowchart LR
    subgraph Before["Before optimization"]
        B1[Input] --> B2[PII detect]
        B2 --> B3[Encrypt PII inline]
        B3 --> B4[Guardrails]
        B4 --> B5[Persist]
        B5 --> B6[Respond]
    end

    subgraph After["After optimization"]
        A1[Input] --> A2[PII detect]
        A2 --> A3[Guardrails]
        A3 --> A4[Respond]
        A4 --> A5[Async encrypt and persist]
    end

What Changed

Stage Before After Why It Improved
PII field encryption Direct inline work for each sensitive write Local AES-GCM with cached DEK Fewer KMS calls
Persistence timing Synchronous for all writes Async for audit-heavy writes Response no longer waits on long-tail persistence
KMS dependency High fan-out Controlled and cached Lower throttle sensitivity

The implementation rule is simple: encrypt before persistence, but not necessarily before the user gets a response if the persistence path itself can be made asynchronous and reliable.


Rotation, Revocation, and Lifecycle

Lifecycle View

stateDiagram-v2
    [*] --> Created
    Created --> Reviewed: policy and ownership validated
    Reviewed --> Enabled
    Enabled --> Rotating: automatic rotation or alias migration
    Rotating --> Enabled
    Enabled --> Restricted: incident containment
    Restricted --> Enabled: risk cleared
    Enabled --> Retired: no new encrypts, old decrypt only
    Retired --> PendingDeletion: all ciphertext migrated or expired
    PendingDeletion --> [*]

Three Different "Rotation" Operations

Operation What Changes Key ID Changes Requires Re-encryption Typical Use
Automatic KMS rotation Backing key material No No Routine annual rotation
Alias swap to new CMK Underlying CMK behind alias Yes Eventually, for old ciphertext if key retirement is required Policy issue, ownership change, containment
Full re-encryption Ciphertext rewritten with new DEKs and key Usually yes Yes Confirmed compromise or migration

This distinction is one of the most common interview follow-ups. A lot of weak answers treat all three as the same thing. They are not.

Scenario 2: Planned Annual Rotation Without Downtime

Routine rotation is intentionally boring.

sequenceDiagram
    participant Sec as Security Platform
    participant KMS
    participant App as Application
    participant Data as Existing Ciphertext

    Sec->>KMS: Enable automatic rotation on CMK
    KMS-->>Sec: Rotation schedule active
    App->>KMS: Encrypt new data with same key ID
    KMS-->>App: New backing material used
    App->>KMS: Decrypt older ciphertext
    KMS->>Data: Match older backing material internally
    KMS-->>App: Plaintext returned

Operationally:

  1. Rotate in staging first and validate reads of pre-rotation ciphertext.
  2. Verify new writes still succeed with no code change.
  3. Confirm no custom code incorrectly pins a key version or key ARN that bypasses the alias.
  4. Enable automatic rotation in production.

Because the key ID does not change, the application and stores keep working. That is why routine rotation is low risk.

Scenario 3: Emergency Alias Swap After Suspected Key Misuse

Emergency rotation is not the same as annual rotation.

flowchart TD
    Alert[Suspicious decrypt signal] --> Triage{False positive?}
    Triage -->|Yes| Close[Close incident]
    Triage -->|No| NewKey[Create new CMK]
    NewKey --> Alias[Move alias to new CMK for future encrypts]
    Alias --> Contain[Restrict old key to decrypt-only path]
    Contain --> Migrate[Background re-encrypt high-risk records]
    Migrate --> Validate[Validate no active writers use old key]
    Validate --> Retire[Retire old key after evidence window]

Important nuance:

  • new writes start using the new CMK immediately after the alias move
  • old ciphertext still needs the old key to decrypt until migration finishes
  • deleting or disabling the old key too early turns the incident into a self-inflicted outage

Scenario 4: Cached DEK Exposure in a Warm Runtime

This is a more realistic threat than "KMS master key stolen." The likely problem is a compromised process or role, not KMS itself.

flowchart LR
    Detect[GuardDuty / EDR / abnormal behavior] --> Freeze[Set reserved concurrency to 0 or drain workers]
    Freeze --> Revoke[Revoke role path or isolate function version]
    Revoke --> Flush[Flush in-memory DEK caches by replacing runtimes]
    Flush --> Harden[Set DEK cache TTL to 0 temporarily]
    Harden --> Assess[Estimate records encrypted during exposure window]
    Assess --> Reencrypt[Re-encrypt affected records if needed]
    Reencrypt --> Restore[Restore normal traffic]

Key insight: automatic CMK rotation does not help if the leaked asset is a plaintext DEK already in memory. The right response is runtime containment plus selective re-encryption of data written under the exposed DEK window.


Disaster Recovery and Restore Design

Encryption adds a second restore dependency: you need the data and the ability to decrypt it in the recovery environment.

DR Dataflow

sequenceDiagram
    participant Primary as Primary Region
    participant PKMS as Primary Keys
    participant Replica as DR Region Bucket
    participant DKMS as DR Keys
    participant Restore as Restore Workflow

    Primary->>Replica: Replicate audit object or backup
    PKMS-->>Primary: Source encryption path
    Replica->>DKMS: Encrypt replica with destination-region key
    Restore->>DKMS: Validate decrypt grants
    Restore->>Replica: Read backup
    Replica-->>Restore: Encrypted object
    Restore->>DKMS: Decrypt in DR environment
    DKMS-->>Restore: Plaintext for controlled restore flow

Recommended DR rules:

  1. Do not assume copied data is useful until decryption and alias mapping are tested.
  2. Keep restore automation in a separate account or pipeline from the normal app path.
  3. Test a real restore at least quarterly, including key grants and break-glass approvals.
  4. Record which ciphertext classes stay decryptable in DR and which are intentionally region-bound.

If the system later requires active-active multi-region chat traffic, this design can evolve toward multi-Region keys for specific replicated datasets. Until then, active-passive is simpler and easier to reason about.


Searching and Analytics Without Broad Decryption

A common mistake is adding strong encryption in storage and then undoing it in analytics by decrypting entire tables into a data lake.

Safer Patterns

Need Pattern What We Avoid
Count PII detections Use guardrail metadata counters Decrypting raw conversations for reporting
Find a record by exact email Store HMAC blind index using separate lookup key Indexing plaintext email in OpenSearch
Investigate a single customer issue Break-glass decrypt of just that record path Exporting large decrypted datasets
Train operational metrics Use redacted or tokenized fields Copying full sensitive payloads into analytics

Principle: analytics should operate on metadata, aggregates, tokens, or blind indexes. Decryption is for support or investigation workflows, not for routine dashboards.


Monitoring, Audit, and Detection

Encryption without observability becomes a false sense of safety. We monitor both success-path health and misuse-path signals.

Signal Source Why It Matters Trigger Example
kms:Decrypt AccessDenied on PII key CloudTrail Someone tried to read what they should not read Any principal other than PII handler or break-glass role
Sudden spike in GenerateDataKey calls CloudTrail / CloudWatch Cache disabled, rollout bug, or abuse 5x baseline for 10 minutes
KMS throttling CloudWatch metrics Hot path latency and request failures P95 KMS latency above threshold
DisableKey or alias update event CloudTrail High-risk control-plane change Any change outside approved pipeline window
Unencrypted or wrong-key S3 writes S3 + CloudTrail Audit evidence control drift PutObject without expected SSE-KMS headers
Secret rotation failure Secrets Manager Aging secrets or broken clients Rotation step fails twice
DEK cache miss ratio spike App metrics Performance regression or forced incident mode Cache miss ratio doubles unexpectedly

Example Detection Pipeline

flowchart LR
    CT[CloudTrail KMS events] --> EB[EventBridge rules]
    EB --> Lambda[Security detection Lambda]
    Lambda --> Pager[Pager / ticket]
    Lambda --> SIEM[SIEM timeline]
    AppMetrics[Application metrics] --> Alarm[CloudWatch alarm]
    Alarm --> Pager

Two rules matter a lot in practice:

  • denied decrypt attempts are alerts, not "all good" signals
  • alias changes are treated as security events even when they are legitimate, because they change the meaning of future encrypt operations

Scenario 5: Investigating Unauthorized Decrypt Attempts

This is the operational scenario interviewers often ask for after you explain the architecture.

Incident Flow

flowchart TD
    Event[CloudTrail event: denied kms:Decrypt on PII key] --> Alert[Security alert created]
    Alert --> Identify[Identify principal and deployment version]
    Identify --> Inspect[Inspect code path and recent change]
    Inspect --> Decision{Benign bug or malicious behavior?}
    Decision -->|Bug| Fix[Remove decrypt call and redesign data access]
    Decision -->|Malicious or unclear| Contain[Disable path, isolate role, preserve evidence]
    Fix --> Guardrail[Add detection rule and review guardrails]
    Contain --> Guardrail
    Guardrail --> Close[Close with post-incident actions]

Example Investigation Narrative

  1. CloudTrail shows a denied kms:Decrypt attempt against alias/mangaassist-pii.
  2. The caller is an analytics batch role, which should never decrypt raw PII.
  3. Recent code added a metric job that tried to inspect encrypted session payloads directly.
  4. IAM blocked it, but the attempt still reveals a broken design assumption.
  5. The correct fix is to move the metric to guardrail metadata or blind indexes rather than widening decrypt access.

The design lesson is that IAM denial is the last line of defense, not the primary detection strategy. If the system is trying to decrypt where it should not, the architecture or the code path needs correction.


Operational Runbooks

Break-Glass Decrypt

Use this path only for narrow support or security investigations:

  1. Investigator receives an approved incident or support case ID.
  2. Temporary role session is tagged with incident_approved=true.
  3. MFA is required.
  4. Decrypt is limited to the minimum records needed.
  5. Every decrypt event is correlated with ticket ID, principal, and record ID.
  6. Session expires automatically.

"KMS Unavailable" Behavior

If KMS is degraded:

  • new PII writes may fail closed if a fresh DEK is required and no cache is available
  • existing cached DEKs allow a short grace window for encrypt operations already in flight
  • decrypt-heavy support workflows should degrade before the customer chat path does
  • audit logging should queue rather than write plaintext fallback files

The system should prefer temporary reduced functionality over silently persisting plaintext.

Deletion and Retention

Deletion logic needs metadata to know what must be erased:

  • primary session record deleted or TTL-expired
  • related blind indexes removed
  • audit evidence retained only if policy requires it, otherwise deleted on schedule
  • secrets rotated and old versions retired
  • backups handled through retention windows rather than piecemeal mutation

Encryption is not a substitute for deletion, but it can reduce exposure during retention windows.


Testing Strategy

  1. Unit tests verify that decrypt fails when encryption context or AAD does not match.
  2. Integration tests verify DynamoDB, S3, and log groups reject writes with the wrong key policy.
  3. Rotation tests verify old ciphertext remains readable after routine KMS rotation.
  4. Incident drills simulate unauthorized decrypt attempts and validate alerting, containment, and audit evidence.
  5. DR tests restore encrypted backups into a recovery environment and validate actual decrypt permissions.
  6. Performance tests compare P95 latency with cache enabled, cache disabled, and forced cache TTL zero.

Good encryption design is operational only when these tests are routine, not theoretical.


Architecture Decisions and Tradeoffs

Decision What We Chose Alternative Upside Downside
Key separation Separate app, PII, audit, and secrets keys One shared CMK Smaller blast radius, better least privilege More policy and lifecycle management
PII encryption style Envelope encryption with AES-GCM and cached DEKs Direct KMS per field Lower latency and cost In-memory DEKs require careful containment
Store encryption SSE-KMS everywhere Service-owned default encryption Better auditability and key control More operational overhead
Audit isolation Separate audit key and S3 object lock Reuse app key and mutable logs Stronger forensic integrity Harder access workflow
Search on sensitive IDs Blind indexes only Plaintext indexing Reduces search-side leakage Supports exact match only, not full-text
DR strategy Active-passive with tested key mapping Active-active from day one Simpler operations Slower failover and more manual preparation

Follow-Up Questions and Deep-Dive Answers

Question 1: Why not use one CMK for everything? It is simpler.

Deep-dive answer: Simpler key inventory creates a more dangerous trust model. If the same CMK protects sessions, PII, audit logs, and secrets, then any policy error, overly broad grant, or investigative decrypt path becomes a cross-system exposure event. Separate keys let us align access with business purpose: app runtimes need app data, PII handlers need sensitive fields, investigators need audit evidence, and secrets access should flow through Secrets Manager. The extra key count is operational overhead, but it buys much tighter blast-radius control.

Question 2: Why not call KMS directly for every encrypt and decrypt?

Deep-dive answer: Because KMS is a trust anchor, not a per-field data plane. Direct KMS encryption increases latency, cost, and throttle sensitivity, especially in a chat product with many short fields. Envelope encryption keeps KMS in the key-distribution role while local AES-GCM handles bulk work efficiently. The control point stays strong because the DEK is still protected by KMS and bound to an encryption context. We accept limited in-memory DEK exposure in exchange for a practical hot path.

Question 3: What is the real risk of caching data keys in memory?

Deep-dive answer: The risk is not hypothetical. If a runtime is compromised, the attacker may extract a plaintext DEK from memory and decrypt data written under that DEK. That is why cache scope, TTL, and use count matter. We keep the cache per runtime instance, never share it, never persist it, and cap both age and reuse. During an incident we can drain workers and set TTL to zero. The design accepts a bounded local risk to avoid an unbounded availability and latency problem.

Question 4: How do you rotate keys without rewriting all old data?

Deep-dive answer: Routine KMS rotation does not change the key ID, so KMS retains older backing material and continues to decrypt old ciphertext transparently. No rewrite is needed. A rewrite becomes necessary only when we intentionally move to a brand-new CMK and want to retire the old one. That usually happens for policy changes, account moves, or incident containment, not for routine annual rotation.

Question 5: What if an analytics team wants to report on encrypted PII fields?

Deep-dive answer: The correct first answer is usually "they should not need raw PII." Analytics should consume metadata, counters, tokenized values, or blind indexes. If the ask is legitimate support lookup, we use a narrow decrypt workflow over a minimal dataset. If the ask is a dashboard, we redesign the data product. Pulling decrypted PII into an analytics environment defeats most of the point of encrypting it in the first place.

Question 6: How do you search encrypted data if a customer needs account help?

Deep-dive answer: We avoid full-text search on raw sensitive fields. For exact-match lookup cases like email or order reference, we store a blind index such as HMAC-SHA256 using a separate lookup key. That gives deterministic equality search without exposing plaintext in the index. Once a match is found, the support path can request controlled decrypt of the specific record if policy allows it. We trade search flexibility for a much tighter privacy posture.

Question 7: How do you stop key administrators from becoming data readers?

Deep-dive answer: By separating key administration from decrypt rights in both IAM and KMS policy. The KMS admin role can manage aliasing, tagging, and rotation settings, but it does not get kms:Decrypt. Break-glass investigation is a different role with MFA, ticket tagging, and short sessions. This is one of the clearest places to show mature separation of duties in a design review.

Question 8: What is your response if you see unauthorized decrypt attempts on the PII key?

Deep-dive answer: First confirm whether the attempts were denied or successful, but treat both as incidents. If denied, investigate the caller, deployment, and code path because something tried to cross a boundary it should not cross. If successful and unexpected, contain immediately: revoke or isolate the role, stop new traffic if needed, preserve logs, assess what records were accessible, and decide whether alias swap or re-encryption is required. The biggest mistake is dismissing denied attempts as harmless noise.

Question 9: What breaks first if KMS is slow or throttled?

Deep-dive answer: The first symptom is usually latency on workflows that need fresh DEKs or secret retrieval. With a healthy cache, the chat path can often tolerate a brief KMS degradation. Without cache, the system amplifies the problem into a customer-facing incident. That is why we measure KMS latency, DEK cache miss ratio, and secret cache miss ratio together. It is also why fail-closed logic needs queueing or retry behavior instead of plaintext fallback writes.

Question 10: How do backups and DR interact with encrypted data?

Deep-dive answer: Backups are only recoverable if the recovery environment can decrypt them. The DR design therefore includes destination-region keys, alias mapping, restore-role grants, and periodic restore tests. A common operational bug is to replicate data but forget to validate the decrypt path until a disaster. The DR story is incomplete unless you have demonstrated end-to-end restore of encrypted data in a separate environment.

Question 11: What would you change at 10x scale?

Deep-dive answer: I would not start by weakening encryption. I would first reduce avoidable decrypts, push more analytics to metadata, tune DEK cache policy carefully, add request batching where it preserves isolation, and consider broader use of async persistence. If the system becomes multi-region active-active, I would revisit which datasets need multi-Region keys and which should remain region-scoped. Scale pressure should push architectural clarity, not cryptographic shortcuts.

Question 12: What is the single most important implementation detail people forget?

Deep-dive answer: They forget that encryption context, alias policy, restore grants, and operational telemetry are part of the design, not afterthoughts. Many diagrams stop at "store uses KMS." Real systems fail on the edges: wrong role gets decrypt, old key disabled too soon, DR restore lacks grants, or a cache bug floods KMS. The implementation details around lifecycle and monitoring are where mature answers separate from surface-level ones.


Key Lessons

  1. Encryption architecture should mirror data sensitivity and access patterns, not just compliance labels.
  2. KMS-backed store encryption is necessary, but it does not replace field-level protection for truly sensitive data.
  3. Routine rotation, emergency alias swap, and full re-encryption are different operations with different risks.
  4. The practical risk surface is usually role misuse, logging mistakes, and restore failure, not broken AES.
  5. If the design cannot explain analytics, DR, and break-glass access, it is not a complete encryption design.

Cross-References