LOCAL PREVIEW View on GitHub

US-MLE-03: Sentiment + ABSA (Aspect-Based Sentiment Analysis) Lifecycle

User Story

As an ML Engineer at Amazon-scale on the MangaAssist team, I want to own the lifecycle of a multi-task DeBERTa-v3 model that produces 3-class sentiment and 18-aspect ABSA scores over the 50M cumulative review corpus — including a quarterly relabel loop, a versioned aspect schema with non-disruptive aspect_v3 → aspect_v4 migrations, async serving for new reviews, nightly batch transform for full-corpus refresh, and bilingual EN/JP aspect-name canonicalisation, So that the Review-Sentiment MCP returns accurate aspect breakdowns to Claude (and downstream to US-MLE-06 recommendation), keeps pace with aspect-emergence (e.g., ai_art_authenticity, vertical_scroll_ux) tracked in Ground-Truth-Evolution/ML-Scenarios/06-absa-aspect-emergence.md, and never trains on spam-poisoned reviews (always reads the spam-cleaned corpus produced by US-MLE-07).

Acceptance Criteria

  • Sentiment macro-F1 ≥ 0.85 (3-class: positive/neutral/negative) on the frozen golden set.
  • Aspect-F1 (micro across the 18 active aspects in aspect_v3) ≥ 0.74 on the golden set.
  • Per-language sentiment macro-F1 ≥ 0.83 EN, ≥ 0.81 JP, ≥ 0.78 mixed.
  • Per-aspect F1 ≥ 0.70 for each of the 6 high-volume aspects (art_style, story_pacing, character_development, translation_quality, print_quality, value_for_money); ≥ 0.60 for the remaining 12.
  • Quarterly vendor relabel of 5K examples completed; cross-vendor IAA (Appen vs Sama on shared 10% subset) Cohen's κ ≥ 0.72 EN, ≥ 0.70 JP.
  • LLM-distilled augmentation labels (Claude Sonnet) capped at 25% of any class's training examples; precision audit on a 500-review vendor sample ≥ 0.90.
  • Async-inference p95 latency ≤ 5s; batch-transform full nightly run on ~5M edited reviews completes within a 4-hour window.
  • Aspect-schema migration from v3 → v4 (adding/deprecating aspects) does not require retraining the 18 stable aspects; only the delta is re-trained.
  • PII redaction (names, addresses, phone, email, order-number patterns) applied at ingestion before training, before embedding, and before nightly batch — verifiable via dual-write audit log.
  • Drift hub emits an SNS aspect-emergence signal when the uncovered_aspect_rate (per aspect-emergence detector) exceeds 4% sustained 14 days; signal triggers a planning-level schema-bump cycle.
  • Rollback to previous registry version via SageMaker async endpoint update + batch-transform job ARN flip ≤ 5 minutes.
  • All artifacts (training, evaluation, serving) reside in ap-northeast-1; review text never leaves residency boundary.

Architecture (HLD)

The Production Surface

The sentiment + ABSA model is the engine behind the Review-Sentiment MCP. It does not sit on the chatbot's hot synchronous path — the MCP reads pre-computed aspects.{art_style: 4.8, story_pacing: 4.6, ...} scores from the OpenSearch review-corpus index. Those pre-computed scores come from this model, run in two modes:

  1. Async inference — every newly-submitted review (~80K/day) is enqueued to a SageMaker async endpoint. Per-review latency is uncritical (max ~30s OK) because the user experience is "review submitted; sentiment shows up within a minute on the title page." The async endpoint absorbs bursts (review-bombing, new-volume-release spikes) without provisioning for peak.
  2. Batch transform — every night, all reviews flagged with edit_event (vote count changed, helpful flag toggled, edited text, ABSA-aspect-schema bumped) — typically ~5M reviews — are re-scored in a single SageMaker batch transform job. This refreshes the OpenSearch index's aspects field and the per-volume aggregate DynamoDB items consumed by get_volume_reception.

The model is multi-task DeBERTa-v3-base (microsoft/deberta-v3-base, 140M parameters) with a shared encoder and two task heads:

Sentiment head:    3-class softmax over {positive, neutral, negative}
Aspect head:       18-class multi-label sigmoid over aspect_v3 schema
                   (each aspect outputs a polarity score in [-1, +1] when present)

The aspect_v3 schema (current; bumps to v4 mid-2026 per the aspect-emergence trigger):

art_style                  story_pacing               character_development
translation_quality        print_quality              value_for_money
shipping                   binding_durability         dialogue_naturalness
panel_layout               worldbuilding              emotional_impact
genre_authenticity         volume_consistency         cover_art
release_cadence            collector_value            content_warnings

The 18 aspects are versioned; the schema versioning is the architectural lever that makes v3 → v4 migrations non-disruptive (see §LLD-9).

End-to-End ML Lifecycle Diagram

flowchart TB
    subgraph CORPUS[Review Corpus Layer]
        RAW[Raw reviews<br/>~80K/day ingested<br/>50M cumulative]
        SPAM[US-MLE-07 Spam Filter<br/>weekly retrain<br/>cleans corpus]
        PII[PII Redactor<br/>Comprehend + regex]
        CLEAN[Spam-clean + PII-redacted<br/>review corpus<br/>Iceberg, ap-northeast-1]
        RAW --> SPAM --> PII --> CLEAN
    end

    subgraph LABELS[Label Layer]
        L1[Vendor Appen<br/>~5K/quarter<br/>primary]
        L2[Vendor Sama<br/>~500/quarter<br/>cross-vendor IAA only]
        L3[LLM-Distill Sonnet<br/>~50K/quarter<br/>augmentation]
        L4[Programmatic<br/>emoji + rating proxy<br/>low-confidence sentiment only]
        LP[Label Platform<br/>aspect_version pinned<br/>Iceberg]
        L1 --> LP
        L2 --> LP
        L3 --> LP
        L4 --> LP
    end

    subgraph SCHEMA[Aspect Schema Layer]
        TAX[aspect_taxonomy.yaml<br/>v3 effective_from 2025-Q4]
        DISC[Aspect-Discovery Loop<br/>quarterly]
        DET[Uncovered-Aspect<br/>Detector]
        TAX -.feeds.-> LP
        DISC --> TAX
        DET -.signals.-> DISC
    end

    subgraph TRAIN[Training Pipeline - SageMaker Pipelines]
        T1[1. Schema + Data Validation]
        T2[2. PIT Feature Materialization]
        T3[3. Stratified Split<br/>language x sentiment x aspect]
        T4[4. Multi-Task Training<br/>g5.12xlarge 1 node DDP<br/>~6h spot ~$45]
        T5[5. Offline Eval - 5 modes]
        T6[6. Slice + Per-Aspect Analysis]
        T7[7. Bias + PII Audit]
        T8[8. Model Registration]
        T1 --> T2 --> T3 --> T4 --> T5 --> T6 --> T7 --> T8
    end

    subgraph SERVE[Serving Layer]
        REG[Model Registry<br/>v22 prod, v23 candidate]
        ASYNC[SageMaker Async Endpoint<br/>g5.2xlarge AS<br/>~80K reviews/day]
        BATCH[SageMaker Batch Transform<br/>nightly ~5M reviews<br/>g5.12xlarge x 4 nodes]
        OS[(OpenSearch<br/>review-corpus index<br/>aspects field)]
        DDB[(DynamoDB<br/>per-volume aggregates)]
        REG --> ASYNC
        REG --> BATCH
        ASYNC --> OS
        BATCH --> OS
        BATCH --> DDB
    end

    subgraph CONS[Consumers]
        MCP[Review-Sentiment MCP<br/>RAG-MCP-Integration/04]
        REC[US-MLE-06 Recommendation<br/>aspect signal]
    end

    subgraph DRIFT[Drift Detection]
        D1[Drift Hub<br/>PSI/KS/aspect-emergence]
        D2[CloudWatch + SNS]
        D1 --> D2
        D2 -.triggers.-> T1
    end

    CLEAN --> T1
    LP --> T1
    TAX --> T1
    T8 --> REG
    OS --> MCP
    DDB --> MCP
    OS --> REC
    ASYNC -.predictions.-> D1
    BATCH -.predictions.-> D1
    DET -.uncovered rate.-> D1

    style CLEAN fill:#9cf,stroke:#333
    style LP fill:#9cf,stroke:#333
    style TAX fill:#fde68a,stroke:#92400e
    style REG fill:#fd2,stroke:#333
    style D1 fill:#fd2,stroke:#333
    style PII fill:#f66,stroke:#333

Data Contracts and Volume

Asset Schema Version Snapshot Cadence Volume Owner
reviews_clean Iceberg table review_v4 Continuous (Kinesis → Iceberg) 50M cumulative; ~80K/day added US-MLE-07 (provider)
aspect_taxonomy.yaml aspect_v3 (current) On taxonomy bump (quarterly–biannually) 18 active aspects Content-Ops + ML Eng (this story)
absa_labels Iceberg table label_v6, keyed on (review_id, aspect_version) Quarterly batches + continuous LLM-distill ~150K labeled cumulative ML Eng + Annotation Vendor PM
absa_model model package model_v22 in prod Quarterly + on aspect-bump trigger 140M params, ~560MB FP32 / ~280MB FP16 ML Eng (this story)
review_aspect_scores OpenSearch field aspects.v3.* Per-review on async; full refresh nightly batch ~50M docs ML Eng + Search Platform
volume_reception DynamoDB per (manga_id, volume_no, aspect_version) Nightly ~2M items ML Eng (this story)
absa_drift_metrics CloudWatch n/a 5min input, daily concept, daily aspect-emergence ~30K data points/day Drift Hub

Model Registry + Promotion Gates

flowchart LR
    R22[Registry v22<br/>prod<br/>aspect_v3 schema<br/>label_v=608]:::prod
    R23[Registry v23<br/>candidate<br/>aspect_v3 schema<br/>label_v=655]:::cand

    R23 --> G1{Stage 1<br/>Offline Gate}
    G1 -->|pass| G2{Stage 2<br/>Shadow async<br/>72h}
    G1 -.fail.-> RB1[Rollback v22]
    G2 -->|pass| G3{Stage 3<br/>Canary<br/>10% async + 5% batch}
    G2 -.fail.-> RB2[Rollback v22]
    G3 -->|pass| G4[Stage 4<br/>Full Promote v23<br/>incl. backfill 5M reviews]
    G3 -.fail.-> RB3[Rollback v22]

    classDef prod fill:#2d8,stroke:#333
    classDef cand fill:#fd2,stroke:#333
    style RB1 fill:#f66,stroke:#333
    style RB2 fill:#f66,stroke:#333
    style RB3 fill:#f66,stroke:#333
    style G4 fill:#2d8,stroke:#333

The four-stage promotion gate follows the standard from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1, with story-specific thresholds:

  • Stage 1 (offline): sentiment macro-F1 ≥ 0.85; aspect-F1 ≥ 0.74; per-aspect F1 ≥ 0.70 (top-6) and ≥ 0.60 (rest); per-language slice gates pass; PII-leak audit passes (zero PII tokens in any prediction explanation, on a 1K sample).
  • Stage 2 (shadow async): 72-hour shadow run on the async endpoint comparing v23 against v22 on incoming reviews; per-aspect agreement rate ≥ 70% (lower than intent's 75% because aspects are multi-label and minor disagreements are normal); p95 async latency ≤ 1.3× v22.
  • Stage 3 (canary): 10% of async traffic + 5% of batch transform run on v23 for 7 days; downstream Review-Sentiment MCP summary-quality spot-check (content-ops 30 summaries/week) shows ≥ 90% agreement.
  • Stage 4 (full): promote, kick off backfill batch-transform over the 5M most-recent reviews to refresh OpenSearch aspects field with v23 scores. v22 retained 30 days for rollback; OpenSearch keeps aspects.v3.* in two parallel sub-fields (one indexed by v22, one by v23) until backfill completes.

The cumulative-window pattern from primitive §7.2 applies here: the training set always includes the entire 150K labeled history, with importance weighting toward recent labels (decay τ = 18 months). This is the right choice because aspect signals are slow-moving; sliding-window training would discard rare-aspect examples.


Low-Level Design

1. PII Redaction — Boundary Discipline

Review text is user-generated and routinely contains PII: shipping addresses ("shipped to 1-2-3 Shibuya..."), order numbers ("order #112-7654321"), names ("thanks Tanaka-san"), email, phone. Per the cross-cutting concern in the README and the foundations doc, PII redaction happens at three boundaries: ingestion (before storage in reviews_clean), embedding (before any vectorisation), and nightly batch (before the model sees the text). The redactor is run once at ingestion and the redacted form is what is stored — but a defensive re-pass runs at training-set load and at batch-transform input, because residency obligations require fail-closed behaviour even on stale data.

# pii_redactor.py
from dataclasses import dataclass
import re
import boto3

# Compiled once; module-level for performance.
_ORDER_RE = re.compile(r"\b\d{3}-\d{7}\b")
_PHONE_JP_RE = re.compile(r"\b0\d{1,4}-\d{1,4}-\d{4}\b")
_PHONE_EN_RE = re.compile(r"\b\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{4}\b")
_EMAIL_RE = re.compile(r"\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,}\b")
_JP_ADDRESS_HINT = re.compile(r"[都道府県市区町村]")


@dataclass
class RedactionReport:
    spans_redacted: int
    redaction_types: dict[str, int]


class ReviewPIIRedactor:
    """Three-layer redaction: regex (fast), Comprehend NER (deep), JP-specific.

    Runs in ap-northeast-1 only. Comprehend client is bound to the region;
    cross-region calls are denied at IAM level.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.comprehend = boto3.client("comprehend", region_name=region)
        # Comprehend Japanese PII detection requires the JP entity-recognizer model.
        self.comprehend_jp = boto3.client("comprehend", region_name=region)

    def redact(self, text: str, language: str) -> tuple[str, RedactionReport]:
        report = RedactionReport(spans_redacted=0, redaction_types={})
        text = self._regex_pass(text, report)
        text = self._comprehend_pass(text, language, report)
        if language in ("ja", "mixed"):
            text = self._jp_specific_pass(text, report)
        return text, report

    def _regex_pass(self, text: str, report: RedactionReport) -> str:
        for re_obj, label in [
            (_ORDER_RE, "ORDER"),
            (_EMAIL_RE, "EMAIL"),
            (_PHONE_JP_RE, "PHONE_JP"),
            (_PHONE_EN_RE, "PHONE_EN"),
        ]:
            n = 0
            text, n = re_obj.subn(f"[{label}]", text), n
            text, count = re_obj.subn(f"[{label}]", text)
            report.spans_redacted += count
            report.redaction_types[label] = report.redaction_types.get(label, 0) + count
        return text

    def _comprehend_pass(self, text: str, language: str, report: RedactionReport) -> str:
        if len(text) < 3 or len(text) > 5000:
            return text
        client = self.comprehend_jp if language == "ja" else self.comprehend
        resp = client.detect_pii_entities(Text=text, LanguageCode=language[:2])
        # Apply redactions back-to-front to avoid offset drift.
        for ent in sorted(resp["Entities"], key=lambda e: -e["BeginOffset"]):
            label = ent["Type"]
            text = text[: ent["BeginOffset"]] + f"[{label}]" + text[ent["EndOffset"]:]
            report.spans_redacted += 1
            report.redaction_types[label] = report.redaction_types.get(label, 0) + 1
        return text

    def _jp_specific_pass(self, text: str, report: RedactionReport) -> str:
        # Address tokens: redact spans containing 都/道/府/県/市/区/町/村
        # plus surrounding katakana/kanji within +/-12 chars.
        if _JP_ADDRESS_HINT.search(text):
            # In production this calls a SudachiPy-based span detector;
            # collapsed here for brevity.
            text = jp_address_span_redactor(text)
            report.spans_redacted += 1
            report.redaction_types["ADDRESS_JP"] = report.redaction_types.get("ADDRESS_JP", 0) + 1
        return text

The redactor's audit log writes (review_id, redaction_report, redactor_version) to a separate Iceberg table pii_redaction_audit. Quarterly audit reads a 1% sample and re-runs the redactor at the latest version; any review with new spans found is flagged and re-redacted across the corpus.

2. Aspect Schema as Versioned Artifact

The 18 aspects are not a constant — they are a product artifact per the aspect-emergence drift counterpart. The taxonomy lives in S3 and is signed off cross-functionally:

# s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml
taxonomy:
  version: 3
  effective_from: 2025-10-01
  signed_off_by: ["content-ops-lead", "ml-eng-lead", "product-pm"]
  aspects:
    - id: art_style
      label_en: "Art / illustration style"
      label_jp: "作画"
      synonyms_en: ["illustration", "drawing", "artwork", "panels"]
      synonyms_jp: ["イラスト", "絵", "画風"]
      example_phrases_en: ["the art is gorgeous", "panels are clean", "linework is detailed"]
      example_phrases_jp: ["絵が綺麗", "作画が安定している"]
      effective_from: 2024-01-01
      stable: true
    - id: story_pacing
      label_en: "Story pacing"
      label_jp: "ストーリー展開"
      synonyms_en: ["pacing", "plot speed", "story flow"]
      synonyms_jp: ["ストーリー", "展開", "テンポ"]
      effective_from: 2024-01-01
      stable: true
    - id: translation_quality
      label_en: "Translation quality"
      label_jp: "翻訳品質"
      synonyms_en: ["translation", "TL", "localization"]
      synonyms_jp: ["翻訳", "ローカライズ"]
      effective_from: 2024-01-01
      stable: true
    # ... 15 more stable aspects ...
  candidate_aspects:    # Discovery-loop output, not yet promoted
    - id: ai_art_authenticity
      candidate_since: 2026-Q1
      promoted_in_version: null   # under review for v4
      sample_phrases_en: ["AI-generated colorist", "looks AI-touched", "filters look AI"]
      sample_phrases_jp: ["AI仕上げっぽい", "AIっぽい彩色"]
    - id: vertical_scroll_ux
      candidate_since: 2025-Q4
      promoted_in_version: null

Labels and predictions are schema-versioned: every absa_labels record is keyed on (review_id, aspect_version), and every prediction is written to aspects.v3.<aspect_id> (versioned sub-field) in OpenSearch. When v3 → v4 happens, OpenSearch keeps both aspects.v3.* and aspects.v4.* populated until the backfill completes. The MCP reads aspects.v3.* until the read-flag flips.

The bilingual aspect-name canonicalisation happens via the label_jp/synonyms_jp mapping. JP reviews mentioning 作画 map to the same internal aspect_id art_style; the sentiment polarity is computed in the JP context but the aspect is named consistently across languages for downstream aggregation.

3. Multi-Task Training Code

Training runs on g5.12xlarge × 1 node (4 × A10G GPUs, DDP) for ~6 hours on spot, ~$45/run. The training script is train.py:

# train.py
import os
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoModel,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
)
from sklearn.metrics import f1_score
import yaml


class AspectTaxonomy:
    """Parses aspect_taxonomy.yaml and provides ordering + canonical IDs."""

    def __init__(self, taxonomy_path: str):
        with open(taxonomy_path) as f:
            tax = yaml.safe_load(f)["taxonomy"]
        self.version = tax["version"]
        self.aspects = [a for a in tax["aspects"] if a.get("stable", False)]
        self.aspect_ids = [a["id"] for a in self.aspects]
        self.id_to_idx = {aid: i for i, aid in enumerate(self.aspect_ids)}
        self.num_aspects = len(self.aspect_ids)


class MultiTaskDeBERTa(nn.Module):
    """Shared DeBERTa-v3 encoder + sentiment head + per-aspect polarity head.

    Implements primitive §3.2 (multi-task training) and the schema-versioned
    aspect head from US-MLE-03 §LLD-9.
    """

    def __init__(
        self,
        model_name: str = "microsoft/deberta-v3-base",
        num_aspects: int = 18,
        sentiment_classes: int = 3,
        task_dropout: float = 0.15,
    ):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size  # 768

        # Sentiment head: 3-class
        self.sentiment_dropout = nn.Dropout(task_dropout)
        self.sentiment_head = nn.Linear(hidden, sentiment_classes)

        # Aspect head: per-aspect (presence_logit, polarity_score)
        # presence: sigmoid; polarity: tanh
        self.aspect_dropout = nn.Dropout(task_dropout)
        self.aspect_presence = nn.Linear(hidden, num_aspects)
        self.aspect_polarity = nn.Linear(hidden, num_aspects)

        self.num_aspects = num_aspects

    def forward(self, input_ids, attention_mask):
        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Pooled = mean over masked tokens (better than [CLS] for DeBERTa-v3)
        mask = attention_mask.unsqueeze(-1).float()
        pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)

        sent_logits = self.sentiment_head(self.sentiment_dropout(pooled))
        asp_pres = self.aspect_presence(self.aspect_dropout(pooled))
        asp_pol = torch.tanh(self.aspect_polarity(pooled))
        return sent_logits, asp_pres, asp_pol


class MultiTaskLoss(nn.Module):
    """Sum of weighted sentiment CE + aspect BCE-with-logits + polarity MSE.

    Weights:
      sentiment: 1.0 (anchor task; never below this weight)
      aspect_presence: 0.7 (multi-label; calibrated on val to balance with sentiment)
      aspect_polarity: 0.5 (only computed where presence label is positive)
    """

    def __init__(self, class_weights_sent: torch.Tensor):
        super().__init__()
        self.ce = nn.CrossEntropyLoss(weight=class_weights_sent)
        self.bce = nn.BCEWithLogitsLoss(reduction="none")
        self.mse = nn.MSELoss(reduction="none")

    def forward(self, outputs, targets):
        sent_logits, asp_pres, asp_pol = outputs
        sent_y, asp_pres_y, asp_pol_y, asp_pol_mask = targets

        loss_sent = self.ce(sent_logits, sent_y)
        loss_pres = self.bce(asp_pres, asp_pres_y).mean()
        # Polarity loss only where presence_y = 1 (mask)
        loss_pol_raw = self.mse(asp_pol, asp_pol_y)
        loss_pol = (loss_pol_raw * asp_pol_mask).sum() / asp_pol_mask.sum().clamp(min=1)

        return loss_sent + 0.7 * loss_pres + 0.5 * loss_pol


def train_one_epoch(model, loader, loss_fn, optimizer, scheduler, device):
    model.train()
    running = 0.0
    for batch in loader:
        input_ids = batch["input_ids"].to(device)
        attn = batch["attention_mask"].to(device)
        sent_y = batch["sentiment"].to(device)
        pres_y = batch["aspect_presence"].to(device)
        pol_y = batch["aspect_polarity"].to(device)
        pol_mask = batch["aspect_polarity_mask"].to(device)

        outputs = model(input_ids, attn)
        loss = loss_fn(outputs, (sent_y, pres_y, pol_y, pol_mask))

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        running += loss.item()
    return running / max(1, len(loader))


def evaluate(model, loader, taxonomy: AspectTaxonomy, device) -> dict:
    model.eval()
    sent_preds, sent_truth = [], []
    pres_preds, pres_truth = [], []
    with torch.no_grad():
        for batch in loader:
            sent_logits, asp_pres, _ = model(
                batch["input_ids"].to(device),
                batch["attention_mask"].to(device),
            )
            sent_preds.extend(sent_logits.argmax(-1).cpu().tolist())
            sent_truth.extend(batch["sentiment"].cpu().tolist())
            pres_preds.extend((torch.sigmoid(asp_pres) > 0.5).cpu().int().tolist())
            pres_truth.extend(batch["aspect_presence"].cpu().int().tolist())

    sent_macro_f1 = f1_score(sent_truth, sent_preds, average="macro")
    aspect_micro_f1 = f1_score(pres_truth, pres_preds, average="micro")
    per_aspect_f1 = f1_score(pres_truth, pres_preds, average=None)
    return {
        "sentiment_macro_f1": float(sent_macro_f1),
        "aspect_micro_f1": float(aspect_micro_f1),
        "per_aspect_f1": dict(zip(taxonomy.aspect_ids, [float(x) for x in per_aspect_f1])),
    }


def main():
    # Hyperparameters from SageMaker channel
    args = json.loads(os.environ.get("SM_TRAINING_ENV", "{}")).get("hyperparameters", {})
    taxonomy = AspectTaxonomy(args["taxonomy_path"])
    tokenizer = AutoTokenizer.from_pretrained(args.get("model_name", "microsoft/deberta-v3-base"))

    train_ds = ABSAReviewDataset(
        s3_uri=args["train_uri"], tokenizer=tokenizer, taxonomy=taxonomy, max_len=320
    )
    val_ds = ABSAReviewDataset(
        s3_uri=args["val_uri"], tokenizer=tokenizer, taxonomy=taxonomy, max_len=320
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = MultiTaskDeBERTa(num_aspects=taxonomy.num_aspects).to(device)
    if torch.cuda.device_count() > 1:
        model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[int(os.environ.get("LOCAL_RANK", 0))],
            output_device=int(os.environ.get("LOCAL_RANK", 0)),
        )

    class_weights = torch.tensor(args["sentiment_class_weights"], dtype=torch.float32).to(device)
    loss_fn = MultiTaskLoss(class_weights_sent=class_weights)

    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
    train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_ds, batch_size=64, shuffle=False, num_workers=4)
    total_steps = len(train_loader) * int(args["epochs"])
    scheduler = get_linear_schedule_with_warmup(
        optimizer, int(0.1 * total_steps), total_steps
    )

    best_f1 = 0.0
    for epoch in range(int(args["epochs"])):
        train_loss = train_one_epoch(model, train_loader, loss_fn, optimizer, scheduler, device)
        metrics = evaluate(model, val_loader, taxonomy, device)
        score = 0.6 * metrics["sentiment_macro_f1"] + 0.4 * metrics["aspect_micro_f1"]
        print(f"epoch={epoch} loss={train_loss:.4f} score={score:.4f} metrics={metrics}")

        # Checkpoint every epoch (spot reclaim discipline; primitive §3.2)
        ckpt_path = f"/opt/ml/checkpoints/epoch_{epoch}.pt"
        torch.save({"model": model.state_dict(), "metrics": metrics}, ckpt_path)
        if score > best_f1:
            best_f1 = score
            torch.save(
                {"model": model.state_dict(), "metrics": metrics, "taxonomy_version": taxonomy.version},
                "/opt/ml/model/best.pt",
            )

    # Save the tokenizer + taxonomy snapshot alongside model
    tokenizer.save_pretrained("/opt/ml/model/")
    with open("/opt/ml/model/taxonomy_snapshot.yaml", "w") as f:
        yaml.safe_dump({"version": taxonomy.version, "aspects": taxonomy.aspect_ids}, f)


if __name__ == "__main__":
    main()

The SageMaker Pipelines wrapper:

# training_pipeline.py
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.parameters import ParameterString, ParameterInteger
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.huggingface.estimator import HuggingFace


def build_absa_pipeline(role: str, region: str = "ap-northeast-1") -> Pipeline:
    label_version = ParameterInteger(name="MaxLabelVersion", default_value=0)
    aspect_version = ParameterInteger(name="AspectVersion", default_value=3)
    taxonomy_uri = ParameterString(
        name="TaxonomyURI",
        default_value="s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml",
    )

    estimator = HuggingFace(
        entry_point="train.py",
        source_dir="src/",
        instance_type="ml.g5.12xlarge",
        instance_count=1,
        role=role,
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        use_spot_instances=True,
        max_wait=28800,
        max_run=21600,                              # 6h hard cap
        checkpoint_s3_uri="s3://manga-ml-checkpoints-apne1/absa/",
        checkpoint_local_path="/opt/ml/checkpoints",
        distribution={"torch_distributed": {"enabled": True}},
        hyperparameters={
            "model_name": "microsoft/deberta-v3-base",
            "epochs": 4,
            "taxonomy_path": "/opt/ml/input/data/taxonomy/aspect_taxonomy.yaml",
            "train_uri": "/opt/ml/input/data/train/",
            "val_uri": "/opt/ml/input/data/val/",
            "sentiment_class_weights": "[1.0, 1.4, 1.2]",  # neutral up-weighted, neg too
        },
    )
    train = TrainingStep(name="ABSAMultiTaskTrain", estimator=estimator)
    # ... Stage 1-7 processing steps from the diagram
    return Pipeline(
        name="ABSAQuarterlyRetrain",
        parameters=[label_version, aspect_version, taxonomy_uri],
        steps=[train],  # plus the validation/eval/bias/registration steps
    )

4. Why Async Inference (and Not Real-Time or Plain Batch)

The choice of SageMaker async inference over real-time-endpoint or pure batch transform is deliberate:

Option Why ruled out / chosen
Real-time endpoint Per-review inference latency is uncritical — the user submits a review and reads the title page later. Real-time would require provisioning for ~80K/day with peak bursts at ~600/min during volume drops; async absorbs bursts via SQS-backed queue. Real-time would also waste money idling between bursts.
Pure batch transform A nightly-only pipeline means new reviews wait up to 24 hours before showing up in aspects.* field. This breaks the user expectation that review submission affects the title-page summary "soon."
Async inference (chosen) Per-review p95 ≤ 5s; queue absorbs review-bombing bursts up to ~5K/min without provisioning for peak; ScaleDownToZero behaviour cuts cost during JP off-peak (07:00–18:00 JST when review submission is light); response payload written to S3 and consumed by an OpenSearch-update Lambda that updates the document's aspects field.

Plus a nightly batch transform for the ~5M reviews that have edit-events (see §LLD-5). The two modes share the same model artifact but use different SageMaker resources and different scaling profiles.

5. Async Endpoint Configuration

# endpoint_config.py
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

def deploy_absa_async_endpoint(model_data_uri: str, version: str) -> str:
    model = HuggingFaceModel(
        model_data=model_data_uri,
        role=role,
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        env={
            "HF_MODEL_ID": "absa_multitask",
            "TAXONOMY_URI": "s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml",
            "MAX_INPUT_LEN": "320",                 # token cap; truncate long reviews
            "MODEL_VERSION": version,
            "PII_REDACTOR_VERSION": "v4",           # defensive re-redaction at inference
        },
    )
    async_cfg = AsyncInferenceConfig(
        output_path=f"s3://manga-ml-async-out-apne1/absa/v{version}/",
        max_concurrent_invocations_per_instance=8,
        notification_config={
            "SuccessTopic": "arn:aws:sns:ap-northeast-1:ACCT:absa-async-success",
            "ErrorTopic":   "arn:aws:sns:ap-northeast-1:ACCT:absa-async-error",
        },
    )
    predictor = model.deploy(
        initial_instance_count=1,
        instance_type="ml.g5.2xlarge",
        endpoint_name=f"absa-async-{version}",
        async_inference_config=async_cfg,
    )

    # Scale-to-zero on JP off-peak; scale up to 4 instances on peak
    autoscaling = boto3.client("application-autoscaling")
    autoscaling.register_scalable_target(
        ServiceNamespace="sagemaker",
        ResourceId=f"endpoint/absa-async-{version}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        MinCapacity=0,                               # async allows 0
        MaxCapacity=4,
    )
    autoscaling.put_scaling_policy(
        PolicyName="ABSAAsyncBacklogTracking",
        ServiceNamespace="sagemaker",
        ResourceId=f"endpoint/absa-async-{version}/variant/AllTraffic",
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        PolicyType="TargetTrackingScaling",
        TargetTrackingScalingPolicyConfiguration={
            "TargetValue": 5.0,                      # 5 messages per instance backlog
            "CustomizedMetricSpecification": {
                "MetricName": "ApproximateBacklogSizePerInstance",
                "Namespace": "AWS/SageMaker",
                "Statistic": "Average",
            },
            "ScaleInCooldown": 300,
            "ScaleOutCooldown": 60,
        },
    )
    return predictor.endpoint_name

6. Nightly Batch Transform (5M Reviews)

# nightly_batch.py
from sagemaker.transformer import Transformer

def run_nightly_batch(model_name: str, input_uri: str, output_uri: str):
    """Re-scores reviews that had an edit_event in the last 24h.

    Input: ~5M reviews/night sharded into 200 input files of ~25K reviews each.
    Compute: g5.12xlarge x 4 nodes; ~2.5h wall clock.
    """
    transformer = Transformer(
        model_name=model_name,
        instance_count=4,
        instance_type="ml.g5.12xlarge",
        strategy="MultiRecord",
        max_concurrent_transforms=8,
        max_payload=6,
        output_path=output_uri,
        assemble_with="Line",
        accept="application/jsonlines",
        env={"TAXONOMY_URI": "s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml"},
    )
    transformer.transform(
        data=input_uri,
        content_type="application/jsonlines",
        split_type="Line",
        job_name=f"absa-nightly-{datetime.utcnow():%Y%m%d}",
    )
    transformer.wait()

The job is scheduled by EventBridge for 02:00 JST (post-spam-retrain Friday from US-MLE-07; other days a no-op spam-retrain check). Output is a JSONL stream consumed by an OpenSearch-bulk Lambda that updates each review's aspects.v3.* field; the Lambda also recomputes the per-volume aggregate DynamoDB items consumed by get_volume_reception.

7. Quarterly Vendor Relabel — the 5K Loop

Every quarter, the annotation vendor (Appen primary; Sama for cross-vendor IAA) labels 5,000 reviews:

Bucket % of 5K Selection
Active learning — high model uncertainty 40% (2,000) Predictions with sentiment-confidence < 0.6 OR aspect-presence-prob in [0.4, 0.6]
High-impact — popular titles 25% (1,250) Reviews on titles in top-1000 catalog by 90-day query volume
Random for distribution coverage 20% (1,000) Stratified sample by language × genre × format
Drift-flagged reviews 10% (500) Reviews flagged by the input-drift detector as far-from-training-centroid
Aspect-emergence candidates 5% (250) Reviews flagged by the uncovered-aspect detector (§LLD-9)

The 40/25/20/10/5 mix mirrors the 50/30/20 active-learning mix from the domain-shift drift counterpart, adapted for ABSA. Cross-vendor IAA is computed on a 10% shared subset (500 reviews labeled by both Appen and Sama). Any aspect with cross-vendor κ < 0.65 EN or < 0.62 JP is flagged for guideline clarification before the labels enter training.

LLM-distilled labels (Claude Sonnet) supply ~50K/quarter for data augmentation. The capping rule from the foundations doc applies: LLM labels are at most 25% of any class's training examples. The LLM-distill prompt is templated and versioned; quarterly precision audit on a 500-review vendor sample must hit ≥ 0.90 or the LLM labels for that quarter are excluded.

8. Bilingual JP/EN Handling

Three concerns specific to JP/EN:

Aspect lexicon canonicalisation. The taxonomy stores label_jp and synonyms_jp per aspect (§LLD-2). At label time, JP reviews are annotated against the JP labels (作画art_style, ストーリーstory_pacing); the canonical aspect_id is the same English ID, so downstream aggregation is bilingual-consistent. This is the platform-shared aspect ontology — US-MLE-06 recommendation reads aspects.v3.art_style regardless of the original review language.

Tokenization. DeBERTa-v3-base uses SentencePiece, which handles JP better than DistilBERT-multilingual's WordPiece (US-MLE-01's pain point). No external pre-tokenizer is needed for sentiment, but ABSA span detection benefits from a Sudachi pre-pass that surfaces aspect-bearing noun phrases — this is a feature engineered into training data (aspect_anchor_tokens), not a serving-time external call. The full DeBERTa-v3 inference is 320 tokens max; reviews longer than this are truncated with the sentiment-bearing tail preserved (the closing summary line of a review is usually higher signal than the opening setup).

Honorifics and politeness. Same as US-MLE-01's normalisation — 〜していただけますでしょうか style polite negative (rare, often satirical) versus クソ direct negative (common, blunt) — both encode the same sentiment but the formal register is rare in training data. The data-augmentation script generates politeness-shifted versions for sentiment-balanced training.

Ironic positive ("尊い" = "precious/holy" — used for emotionally devastating chapters that are positively-received). This is the JP analogue of "this kills me" / "no thoughts head empty" from the domain-shift counterpart. The model handles it with mixed success; the abstain path (§LLD-11) catches the worst cases.

9. Aspect-Schema Migration: v3 → v4 Without Retraining the 18 Stable Aspects

The architectural lever that makes schema bumps cheap is head-modular re-training. The shared encoder + sentiment head + aspect head structure means that adding a new aspect (e.g., ai_art_authenticity) is a delta operation:

flowchart LR
    V3[v3 model<br/>encoder + sent_head + aspect_head 18-dim]
    DELTA[Delta training:<br/>freeze encoder<br/>freeze sentiment_head<br/>freeze 18 stable aspect dims<br/>train ONLY new aspect dims]
    V4[v4 model<br/>encoder + sent_head + aspect_head 20-dim]

    V3 --> DELTA --> V4

    style V3 fill:#9cf,stroke:#333
    style DELTA fill:#fde68a,stroke:#92400e
    style V4 fill:#2d8,stroke:#333

Concretely:

  1. Content-ops + ML-eng promote 2 candidate aspects (e.g., ai_art_authenticity, vertical_scroll_ux) to taxonomy v4 with effective_from=2026-07-01.
  2. Differential re-labelling per the aspect-emergence counterpart §"How do you keep re-labeling cost bounded?": only re-label reviews where keyword heuristics suggest the new aspect ("AI", "AI仕上げ", "AI-touched" for ai_art_authenticity; "scroll", "スクロール" for vertical_scroll_ux). Roughly 8K reviews re-labelled per new aspect.
  3. Build the v4 model: load v3 weights, expand aspect_presence and aspect_polarity linear layers from (768 → 18) to (768 → 20) by appending fresh random rows for the 2 new aspects. Freeze all other parameters.
  4. Train for 1 epoch on the delta-labelled set (~16K examples) with a higher learning rate (5e-5) on the new rows only. ~30 minutes on g5.2xlarge, ~$3.

This produces v4 with the 18 stable aspects' F1 unchanged (frozen weights → identical predictions) and the 2 new aspects' F1 measured against v4-only labels per the aspect-emergence counterpart's per-aspect-version F1 rule. A full re-train (full encoder unfrozen, all 20 aspects) runs every 2 schema-bumps to prevent the encoder drifting away from the new aspect distribution.

The OpenSearch index keeps aspects.v3.* and aspects.v4.* as parallel sub-fields during the migration window. The MCP read-flag is flipped via SSM Parameter Store after the v4 backfill completes (~3 hours batch-transform on the delta-flagged ~1M reviews).

10. Offline Evaluation (5 modes from primitive §4.1)

The five evaluation modes for ABSA:

# offline_eval.py
class ABSAOfflineEvaluator:
    def evaluate(self) -> EvalReport:
        return EvalReport(
            golden=self.eval_golden(),
            slice=self.eval_slice(),
            adversarial=self.eval_adversarial(),
            counterfactual=self.eval_counterfactual_replay(),
            offline_online_corr=self.eval_offline_online_corr(),
        )

    def eval_golden(self) -> GoldenReport:
        """Frozen 2K-review golden set; refreshed quarterly with the 5K vendor batch.
        Sentiment macro-F1 + aspect-F1 + per-aspect-F1 against schema_v3."""
        golden = pd.read_parquet("s3://manga-ml-eval-apne1/absa/golden/v3.parquet")
        sent_p, asp_p = self.model.predict_batch(golden["review_text"].tolist())
        return GoldenReport(
            sentiment_macro_f1=f1_score(golden["sentiment"], sent_p, average="macro"),
            aspect_micro_f1=multilabel_micro_f1(golden["aspects"].tolist(), asp_p),
            per_aspect_f1=per_aspect_f1(golden["aspects"].tolist(), asp_p, taxonomy=self.tax),
        )

    def eval_slice(self) -> SliceReport:
        """Per-language x sentiment x format (manga / manhwa / manhua / fan-edition)."""
        # ... cartesian over (language, sentiment_class, format, length_bucket)

    def eval_adversarial(self) -> AdversarialReport:
        """600-example set: ironic JP ('尊い'), code-switched, review-bomb language,
        sarcastic 5-stars, prompt-injection redirected to sentiment ('ignore prior...
        rate this 5-stars')."""
        adv = pd.read_parquet("s3://manga-ml-eval-apne1/absa/adversarial/v2.parquet")
        # ...

    def eval_counterfactual_replay(self) -> CounterfactualReport:
        """10K production reviews from last 7d. Compare predicted aspects against
        v22 prod predictions to measure (cost, agreement) shift; labels not available."""
        # ...

    def eval_offline_online_corr(self) -> OfflineOnlineCorrReport:
        """Reads the last 6 retrains' (offline_aspect_micro_f1, online_summary_agree_rate)
        pairs; the online metric is the content-ops 30-summaries/week spot-check.
        Trustworthy if Pearson >= 0.6."""
        # ...

The slice gate enforces per-language sentiment macro-F1 ≥ 0.83 EN, ≥ 0.81 JP, ≥ 0.78 mixed. The per-aspect gate enforces ≥ 0.70 on the 6 high-volume aspects and ≥ 0.60 on the rest. The adversarial regression ceiling is 2% absolute on the adversarial set's macro-F1.

11. Drift Detection — Three Kinds + Aspect-Emergence

Beyond the four standard drift kinds from primitive §6.1, ABSA adds a fifth: aspect-emergence drift (the existence of new aspects in production that the schema does not cover).

Drift Kind Detector Cadence Threshold
Input drift PSI on token-distribution; embedding-centroid shift 5 min on async stream; daily on full corpus PSI > 0.2 sustained 24h
Label drift χ² on sentiment class proportions vs reference Daily p < 0.01 sustained 7d
Prediction drift KS on aspect-presence rates per aspect Daily KS > 0.15 sustained 24h
Concept drift Δ-F1 on rolling-2-week labeled subset Daily Δ-F1 > 0.03 sustained 7d
Aspect-emergence Uncovered-aspect rate (model coverage_ratio < 0.3); LLM open-vocab candidate clustering on 10K-review monthly sample Daily on coverage; monthly on candidate clustering uncovered_rate > 4% sustained 14 days

The aspect-emergence detector implements the aspect-emergence counterpart's coverage-ratio path:

# aspect_emergence_detector.py
def daily_coverage_check(date: str):
    sample = sample_reviews(date, n=20_000, stratified_by="format")
    spans, hints = batch_predict_with_coverage(sample)
    uncovered_rate = (
        sum(1 for h in hints if h is not None and h.coverage_ratio < 0.3) / len(sample)
    )
    cw_metric.put("UncoveredAspectRate", uncovered_rate, dimensions={"format": ...})
    if uncovered_rate > 0.04 and sustained_for_days(14):
        sns.publish(
            TopicArn="arn:...:absa-aspect-emergence",
            Message=json.dumps({"rate": uncovered_rate, "sample_phrases": top_uncovered_phrases(hints)}),
        )

The SNS message triggers a planning-level schema-bump cycle (not an automated retrain); content-ops + ML-eng convene, run the LLM open-vocab discovery on a 10K sample, and decide whether to promote candidates to v4. This explicitly ties to aspect-emergence drift counterpart: the trigger is in code; the response is human-in-the-loop.

12. Retraining Trigger Logic

Three retrain triggers:

  • Quarterly scheduled: EventBridge fires on the first Monday of each quarter. Pulls the new 5K vendor labels + accumulated LLM-distill (capped) and re-trains. Full unfreeze.
  • Aspect-emergence triggered: SNS topic absa-aspect-emergence fires when the uncovered-aspect rate exceeds 4% sustained 14 days. Triggers a delta retrain after the human review confirms new aspects to promote (§LLD-9).
  • Concept-drift triggered: SNS fires when Δ-F1 on the rolling-2-week labeled subset exceeds 0.03 sustained 7 days. Forces a quarterly retrain ahead of schedule.

Same check_promotion_eligibility gate as US-MLE-01 (global_ml_freeze, per-model promotion flag, no-concurrent-retrains).

13. Cross-Story Dependency: Reading the Spam-Clean Corpus

Per the README dependency graph, this story reads the spam-cleaned review corpus from US-MLE-07 — never the raw stream. The contract:

  • US-MLE-07's weekly retrain produces reviews_clean Iceberg snapshot every Friday 06:00 JST.
  • This story's quarterly training reads reviews_clean at the latest snapshot ≥ 7 days old (to give US-MLE-07 a buffer to roll back if its weekly retrain fails its own promotion gate).
  • This story's nightly batch transform reads reviews_clean at the most-recent snapshot.
  • A US-MLE-07 promotion-rollback (spam classifier rolled back) automatically re-triggers this story's drift hub to flag potential training-data poisoning; if the rollback affects > 5% of the last 24h's reviews, the nightly batch is delayed and re-run after reviews_clean is re-snapshotted.

This is the cross-story coordination that the README flagged as a sequencing requirement.


Monitoring & Metrics

Category Metric Target Alarm Threshold
Online — Async p50 latency ≤ 1.5 s > 3 s 5min
p95 latency ≤ 5 s > 8 s 5min
Backlog size < 100 messages > 500 sustained 10min
Error rate < 0.1% > 1% 5min
Online — Batch Nightly wall-clock ≤ 4 h > 6 h
Per-review cost ≤ $0.00012 > $0.0002
Quality — Sentiment Macro-F1 (3-class) ≥ 0.85 < 0.80 quarterly
Per-language EN/JP/mixed ≥ 0.83 / 0.81 / 0.78 drop > 0.03
Quality — Aspect Aspect-F1 micro ≥ 0.74 < 0.70 quarterly
Per-aspect F1 (top-6) ≥ 0.70 each < 0.65 any
Per-aspect F1 (rest) ≥ 0.60 each < 0.55 any
Quality — MCP feedback Content-ops 30-summary/wk agreement ≥ 90% < 85% 4 weeks
Drift PSI top-30 tokens < 0.2 > 0.2 24h
Aspect-presence KS per aspect < 0.15 > 0.15 24h
Uncovered-aspect rate < 4% > 4% 14d → SNS
Δ-F1 vs reference < 0.03 > 0.03 7d
Cost Quarterly training $ ≤ $50 > $80
Async serve $/1k ≤ $0.18 > $0.30
Batch nightly $ ≤ $90 > $150
Pipeline Quarterly retrain success rate ≥ 95% < 90% 1y rolling
PII PII-leak audit (1K samples) 0 unredacted spans any positive

Risks & Mitigations

Risk Impact Mitigation
Aspect schema falls behind cultural emergence MCP summaries miss what readers actually say (per aspect-emergence counterpart) Quarterly aspect-discovery loop; uncovered-aspect detector; SNS trigger; head-modular delta retrain
LLM-distill labels concentrate Sonnet's biases Training set inherits LLM failure modes (irony blindness) 25% per-class cap; quarterly precision audit ≥ 0.90 vs vendor; cap on JP-specific irony aspects
JP politeness register under-represented JP polite-negative reviews mis-classified as neutral Augmentation generates politeness-shifted versions; per-register slice gate
Cross-vendor IAA collapses mid-quarter Subtle quality regression in next training cycle Cross-vendor IAA computed on 10% shared subset; per-aspect κ tracked; vendor billed for re-annotation if κ < contractual floor
Spot-instance reclaim during 6h DDP run Wasted compute, missed quarterly window Checkpoint every epoch + 500 steps to S3; max_wait=28800s; on-demand fallback if cumulative reclaim > 90 min
Aspect-schema bump breaks OpenSearch read consumers MCP returns empty aspects.v3.* while v4 backfilling Parallel aspects.v3.* + aspects.v4.* sub-fields; SSM read-flag flip after backfill complete; rollback by flipping back
PII redactor regresses on JP addresses Privacy incident Quarterly audit re-runs latest redactor on 1% sample; any new spans found triggers full re-redaction batch
Nightly batch overlaps with US-MLE-07 weekly retrain Race condition on reviews_clean snapshot Cross-story snapshot pinning: this story reads snapshot ≥ 7 days old for training; nightly batch reads latest
Review-bombing burst saturates async queue Backlog grows; new reviews delay > 10 min Auto-scale to 4 instances on backlog > 5/instance; review-bomb anomaly detector pauses ingest; SEV-3 page
Aspect-emergence false-positive (LLM hallucinates aspect) Wasted relabel budget Two-filter discipline: frequency threshold across N reviews + lightweight semantic filter; humans validate all candidates
Schema mismatch between training and serving Predictions written to wrong sub-field taxonomy_snapshot.yaml saved alongside model; serving handler refuses to start if endpoint env TAXONOMY_URI version != model's snapshot

Deep Dive — Why This Works at Amazon-Scale on the Manga Workload

The MangaAssist ABSA model is medium-sized (140M params) and serves an enormous corpus (50M cumulative reviews; ~5M re-scored nightly; ~80K new/day). Five workload properties make this design the right shape rather than the obvious "fine-tune BERT and call it done":

Workload property 1: aspect schema is alive, not constant. The catalog grows (manhwa, manhua, fan-editions, AI-art controversies); reader concerns evolve (vertical_scroll_ux was not an aspect in 2024; ai_art_authenticity was not an aspect in 2025). A frozen-taxonomy ABSA model is structurally blind to emergence — held-out F1 cannot detect missing aspects by construction. The discovery loop + uncovered-aspect detector + head-modular delta retrain is the only way to keep up without re-training the whole encoder every quarter. This is the single biggest divergence from US-MLE-01: intent has 14 stable classes; ABSA has 18+ that grow.

Workload property 2: latency budget is asymmetric across modes. New-review async tolerates ~5s p95; nightly batch tolerates ~4h wall-clock; the hot-path MCP read tolerates ~50ms (it reads pre-computed scores from OpenSearch). The choice of async + batch (not real-time) follows directly: each mode is provisioned for its actual latency budget, and the model is the same artifact. A naïve "deploy everything as real-time" would over-provision by 10× during JP off-peak.

Workload property 3: the 50M corpus is the asset, not the model. The OpenSearch review-corpus index with aspects.v3.* pre-computed is the production surface. The model exists to populate that index. This inverts the usual "model = product" framing — here, the model is a data-processing function that runs every 24 hours over millions of records. Every architectural decision (head-modular schema migration, parallel sub-field indexing, backfill orchestration) is in service of keeping the index fresh and consistent without query-time cost.

Workload property 4: bilingual aspect canonicalisation is a platform contract. The art_style aspect_id is the same whether the review was JP (作画) or EN (art_style). Downstream consumers (US-MLE-06 recommendation, the MCP's compare_sentiment tool) aggregate across languages and require consistent IDs. The taxonomy.yaml's bilingual aliases are not ergonomic sugar — they are the contract that makes cross-language summaries possible. Skipping this would force every consumer to re-implement aspect canonicalisation, a distributed-monolith failure pattern.

Workload property 5: spam-clean corpus is a hard upstream dependency. ABSA quality is bounded by spam recall. If US-MLE-07 misses 5% of spam, that 5% poisons training (spam reviews tend to be extreme-positive) and skews the polarity distribution toward false positive. The cross-story snapshot-pinning (read US-MLE-07's snapshot ≥ 7 days old for training; latest for nightly batch) is the structural protection that makes this story's quality contract independent of US-MLE-07's current quality and dependent only on US-MLE-07's promoted quality.

These five properties together explain why the head-modular schema migration + async/batch dual-mode + bilingual canonicalisation + cross-story snapshot pinning machinery exists. A naïve quarterly retrain over a frozen taxonomy with real-time serving would silently regress on emerging aspects, over-provision compute by 10×, and inherit US-MLE-07's transient quality bugs as training poisoning.


Real-World Validation

Industry analogues. Amazon's product-review aspect extraction team (the team behind the "Product features mentioned" section of product detail pages) uses an analogous multi-task architecture with a versioned aspect taxonomy and head-modular retraining for new aspects. Their published 2024 architecture overview describes the same uncovered-aspect surfacing pattern (a phrase like "buyers also discussed [phrases]") under a different name. Yelp's review-aspect team published a 2023 paper describing aspect-schema versioning with effective_from semantics very close to this story's aspect_taxonomy.yaml. Booking.com's review-NLP team uses a Sonnet-distilled augmentation loop with similar capping rules (their cap is 30%; this story's is 25%, slightly stricter because manga reviews are more domain-specific than hotel reviews).

Math validation — async serving cost. On ml.g5.2xlarge at $1.515/hr in ap-northeast-1, with auto-scale 0–4 averaging 1.2 instances: $1.515 × 1.2 × 24 × 30 = $1,309/month. Per-review cost: $1,309 / (80,000 × 30) = $0.000545 per review, or $0.55 per 1K reviews. Below the $0.18/1K target only after factoring batch transform amortisation, but the budget is async + batch combined (next row).

Math validation — nightly batch cost. g5.12xlarge × 4 nodes at $5.672/hr × 4 × 2.5h × 30 nights/month = $1,701/month. Per-review batch cost: $1,701 / (5,000,000 × 30) = $0.0000113 per review, or $0.011 per 1K. Combined async + batch: $3,010/month total, against ~50M cumulative reviews touched/month → $0.060 per 1K reviews touched. Fits the target.

Math validation — training cost. g5.12xlarge at $5.672/hr × 6h spot (~70% spot discount) = $5.672 × 6 × 0.30 = $10.21/run + ~$35 in spot-reclaim recovery overhead → ~$45/run quarterly. Yearly: $180. Immaterial against the $36K/year combined async+batch.

Math validation — label volume. 5K vendor + 50K LLM-distill (capped to 25%) per quarter → effective training-set growth = 5K + min(50K, 0.25 × 150K) = 5K + 37.5K = ~42K per quarter, or ~28% per-quarter cumulative-set turnover. This is a healthy refresh rate for a quarterly cumulative-window model — it sees enough new signal to track drift without overwhelming the prior. The 18-month exponential decay τ on training weights means labels older than 36 months contribute < 25% weight, which gives the model effective "cumulative with forgetting" semantics.

Math validation — schema-migration delta retrain. Adding 2 new aspects: ~16K delta-labeled examples, 1 epoch on g5.2xlarge (1× A10G), batch size 32, ~500 steps, ~30 min × $1.515/hr = $0.76 + spot-reclaim buffer ~$3 total. Compared to a full retrain at $45, the delta is 15× cheaper, which is what makes the head-modular pattern operationally viable for quarterly schema bumps.


Cross-Story Interactions

Edge Direction Contract Conflict mode
US-MLE-07 (spam) → US-MLE-03 reviews_clean Iceberg snapshot This story trains on snapshot ≥ 7 days old; reads latest for nightly batch If US-MLE-07's promotion rolls back affecting > 5% of last 24h, this story delays nightly batch
US-MLE-03 → US-MLE-06 (recommendation) aspect scores as features aspects.v3.* bilingual-canonical IDs; pinned to absa_model_v22 in cache key If US-MLE-03 promotes v23 mid-week, US-MLE-06 reads stale v22 scores until OpenSearch backfill completes; tolerable because aspect drift is slow
US-MLE-03 → RAG-MCP-Integration Review-Sentiment MCP aspects.v3.* pre-computed in OpenSearch; per-volume aggregates in DynamoDB MCP reads aspects.<active_version>.* controlled by SSM flag Schema migration v3 → v4: parallel sub-fields populated; flag flip is atomic
US-MLE-03 ↔ US-MLE-01 (intent) both consume cross-vendor IAA infra; both report bilingual slice metrics IAA platform contract; per-language slice gates None (parallel users of the same platform)
US-MLE-03 ← Ground-Truth-Evolution ML-02 (sentiment domain shift) drift detection patterns drift-eval set, regression set, active-learning 50/30/20 mix adapted to 40/25/20/10/5 The drift counterpart is the "what breaks over time" lens for this story
US-MLE-03 ← Ground-Truth-Evolution ML-06 (aspect emergence) schema versioning + uncovered-aspect path aspect_taxonomy.yaml with effective_from; uncovered-aspect detector The aspect-emergence counterpart is the trigger source for v3 → v4 migration
US-MLE-03 → Cost-Optimization US-04 (compute) async + batch SageMaker compute scale-to-zero on async; spot for batch + training Async backlog SLA must hold even when scale-to-zero policy is more aggressive

Rollback & Experimentation

Shadow Mode Plan

  • Duration: 72 hours minimum on async (3× longer than US-MLE-01 because per-review aspect predictions are multi-label and noisier; need 3× more samples for power).
  • Sample size: ~80K async + 15M batch over 3 days = ~15M shadow predictions; statistical power overwhelming.
  • Pass criteria: per-aspect agreement with v22 between 60% and 90% (multi-label; lower than sentiment's 70-90% range); p99 async latency ≤ 1.3× v22; batch-transform wall-clock ≤ 1.2× v22.
  • Slice criteria: per-language agreement within ±5pp of global; per-format agreement (manga / manhwa / manhua) within ±7pp; uncovered-aspect rate ≤ v22's rate (regression here means the new model is worse at coverage, not better).

Canary Thresholds

  • Phase A (10% async, 3 days): per-aspect accuracy on the running fresh-label sample ≥ v22 baseline; sentiment macro-F1 not regressed.
  • Phase B (10% async + 5% batch, 4 days): same as Phase A plus content-ops 30-summary spot-check showing ≥ 90% agreement; per-volume aggregate DynamoDB items show no abrupt shifts (KS on per-volume avg sentiment ≤ 0.10).
  • Phase C (full promote with backfill): after the 5M-review backfill completes, MCP read-flag flips; v22's aspects.v3.* retained 30 days for rollback.

Kill-Switch Flags

  • absa_promotion_enabled (default: false; SSM Parameter Store /manga-ml/absa/promotion_enabled).
  • absa_async_paused — when true, async endpoint queue drains but no new invocations; new reviews stack in SQS DLQ for replay after unpause.
  • absa_batch_paused — when true, nightly batch skipped; OpenSearch aspects.* field staleness grows.
  • absa_taxonomy_version_active — SSM string; controls which aspects.vX.* sub-field the MCP reads. Atomic flip during v3 → v4 migration.
  • global_ml_freeze — overrides all of the above. Per README kill-switch precedence.

Quality Regression Criteria (Hard Rollback)

A canary that satisfies any of these is automatically rolled back:

  • Sentiment macro-F1 on the running fresh sample regresses by > 0.03 absolute on any 24h window.
  • Per-aspect F1 on any of the top-6 aspects regresses by > 0.05 absolute on any 24h window.
  • Async p99 latency exceeds 8s for any 1h window.
  • Async error rate exceeds 1% for any 5min window.
  • Batch-transform fails (incomplete output) for any single nightly run.
  • Content-ops 30-summary agreement drops below 80% for any 1-week window.
  • PII-leak audit detects any unredacted spans in 1K-sample inference output.

Rollback path: SageMaker async endpoint update (model package v22) ≤ 5 minutes; nightly batch job ARN flipped to v22 in EventBridge; MCP read-flag stays on aspects.v3.* if v23 was a same-version-schema candidate, or flips back to aspects.v3.* if v23 was a v4-schema candidate.


Multi-Reviewer Validation Findings & Resolutions

S1 — Must Fix Before Production

ML Scientist lens: The multi-task loss weighting (1.0 / 0.7 / 0.5) is an opening guess, not a calibrated value. With sentiment as the anchor task and aspect as multi-label sigmoid, the gradient magnitudes are not naturally comparable — a fixed weighting will favour whichever head's gradients are larger by happenstance. Resolution: Implement GradNorm-style adaptive task weighting that rebalances each epoch based on per-head loss-decrease rate; treat the (1.0, 0.7, 0.5) values as initial conditions, not constants. Documented in train.py with the GradNorm implementation hidden behind a --adaptive-weights flag, default true.

Application Security / Privacy lens: PII redaction at ingestion is good but the training-set materialization re-reads from reviews_clean and trusts the redaction was already applied. If the redactor was buggy at the time the review was ingested, the bug propagates into training. Resolution: Defensive re-redaction at training-set load using the latest redactor version; quarterly audit re-runs the latest redactor on the full corpus and flags any reviews with new spans found, triggering surgical re-write.

SRE lens: A 6-hour DDP training job on spot is fine on average but the worst-case-reclaim profile is bad — three reclaims at hour 4, 5, 5.5 wastes ~90 minutes of compute and pushes outside the quarterly window. Resolution: On-demand fallback policy — if cumulative reclaim time exceeds 90 minutes, the next checkpoint resume runs on-demand. Documented in Runbooks/absa-spot-fallback.md. Same pattern as US-MLE-01.

Data Engineering lens: The aspect-emergence detector samples 20K reviews/day for coverage-ratio computation — at 80K/day that's 25% of new reviews. Over a quarter that's 7M coverage-ratio computations, which is materially expensive. Resolution: Reduce daily sample to 5K, increase the sustained-window from 14 days to 21 days (more days × fewer samples preserves statistical power); validated against historical 2-year coverage data showing equivalent detection lag.

S2 — Address Before Scale

Data Engineering lens (S2): OpenSearch index with parallel aspects.v3.* + aspects.v4.* sub-fields doubles storage during migration windows. At 50M docs × ~4KB aspects payload × 2 = ~400GB of duplicate data. Resolution: schema-migration TTL policy — aspects.<old_version>.* is dropped 30 days post-flip; storage rolls back. Documented in Runbooks/absa-schema-migration-storage.md.

FinOps lens: Combined async + batch + training cost is ~$3,200/month — under budget but trending up as catalog grows. Resolution: ABSA cost participates in the Cost-Optimization quarterly review (Cost-Optimization US-04 compute optimization). Reserve the option to move nightly batch to Graviton-based instances if/when SageMaker offers a cost-effective Graviton path for DeBERTa-v3 (currently A10G is faster per dollar but the gap is closing).

ML Scientist lens (S2): Offline-online correlation between aspect-F1 and content-ops summary-agreement has been 0.52 over the last 4 retrains — below the 0.6 trustworthy floor. This means offline aspect-F1 is only a partial leading indicator of the online quality the MCP cares about. Resolution: Per the foundations doc §4.3, ABSA is flagged as "always extended canary" — Phase B runs at 5% batch coverage for 7 days instead of 4; the content-ops summary-agreement metric is the primary online gate. Tracked for re-evaluation each quarter.

S3 — Future Work

Principal Architect lens: The 18-aspect taxonomy is shared across manga, manhwa, manhua. Aspects emerge differently per format (vertical scroll for webtoons; binding for collectors). A federated taxonomy per the aspect-emergence counterpart's architect-level escalation 1 — shared core aspects + per-format extensions — may be the right v2 architecture. Tracked as a v2 backlog item; revisit after fan-fiction support lands (catalog-expansion-driven).

ML Scientist lens (S3): Replace LLM-distill (Sonnet) with a self-distilled student model fine-tuned on the in-house labelled set. The student would be smaller, cheaper to run for label generation, and free of Sonnet's version-instability. Defer until the LLM-distill cost crosses ~$5K/quarter (currently ~$1.5K).

SRE lens (S3): Multi-region active-active. Currently this story serves only ap-northeast-1. EU/US expansion would require cross-region replication of reviews_clean, the model artifact, and the OpenSearch index — non-trivial residency engineering. Tracked under data-residency expansion roadmap.

FinOps lens (S3): Investigate whether the nightly batch ~5M reviews re-score is actually necessary for all 5M, or whether the subset of edit-events that meaningfully shift aspect scores is much smaller (~500K). If yes, the nightly batch shrinks 10× and cost drops to ~$170/month. Requires building an aspect-impact-of-edit predictor; defer until ROI is clear.

Content-Ops lens (S3): The uncovered-aspect detector currently fires SNS to ML-Eng on-call. As the discovery loop matures, route the SNS to a content-ops dashboard with a "promote to v4 candidate" workflow built directly into the dashboard — reduces the 14-day-detection → human-review latency. Tracked alongside the content-ops tooling roadmap.