US-MLE-03: Sentiment + ABSA (Aspect-Based Sentiment Analysis) Lifecycle
User Story
As an ML Engineer at Amazon-scale on the MangaAssist team,
I want to own the lifecycle of a multi-task DeBERTa-v3 model that produces 3-class sentiment and 18-aspect ABSA scores over the 50M cumulative review corpus — including a quarterly relabel loop, a versioned aspect schema with non-disruptive aspect_v3 → aspect_v4 migrations, async serving for new reviews, nightly batch transform for full-corpus refresh, and bilingual EN/JP aspect-name canonicalisation,
So that the Review-Sentiment MCP returns accurate aspect breakdowns to Claude (and downstream to US-MLE-06 recommendation), keeps pace with aspect-emergence (e.g., ai_art_authenticity, vertical_scroll_ux) tracked in Ground-Truth-Evolution/ML-Scenarios/06-absa-aspect-emergence.md, and never trains on spam-poisoned reviews (always reads the spam-cleaned corpus produced by US-MLE-07).
Acceptance Criteria
- Sentiment macro-F1 ≥ 0.85 (3-class: positive/neutral/negative) on the frozen golden set.
- Aspect-F1 (micro across the 18 active aspects in
aspect_v3) ≥ 0.74 on the golden set. - Per-language sentiment macro-F1 ≥ 0.83 EN, ≥ 0.81 JP, ≥ 0.78 mixed.
- Per-aspect F1 ≥ 0.70 for each of the 6 high-volume aspects (
art_style,story_pacing,character_development,translation_quality,print_quality,value_for_money); ≥ 0.60 for the remaining 12. - Quarterly vendor relabel of 5K examples completed; cross-vendor IAA (Appen vs Sama on shared 10% subset) Cohen's κ ≥ 0.72 EN, ≥ 0.70 JP.
- LLM-distilled augmentation labels (Claude Sonnet) capped at 25% of any class's training examples; precision audit on a 500-review vendor sample ≥ 0.90.
- Async-inference p95 latency ≤ 5s; batch-transform full nightly run on ~5M edited reviews completes within a 4-hour window.
- Aspect-schema migration from
v3 → v4(adding/deprecating aspects) does not require retraining the 18 stable aspects; only the delta is re-trained. - PII redaction (names, addresses, phone, email, order-number patterns) applied at ingestion before training, before embedding, and before nightly batch — verifiable via dual-write audit log.
- Drift hub emits an SNS aspect-emergence signal when the
uncovered_aspect_rate(per aspect-emergence detector) exceeds 4% sustained 14 days; signal triggers a planning-level schema-bump cycle. - Rollback to previous registry version via SageMaker async endpoint update + batch-transform job ARN flip ≤ 5 minutes.
- All artifacts (training, evaluation, serving) reside in
ap-northeast-1; review text never leaves residency boundary.
Architecture (HLD)
The Production Surface
The sentiment + ABSA model is the engine behind the Review-Sentiment MCP. It does not sit on the chatbot's hot synchronous path — the MCP reads pre-computed aspects.{art_style: 4.8, story_pacing: 4.6, ...} scores from the OpenSearch review-corpus index. Those pre-computed scores come from this model, run in two modes:
- Async inference — every newly-submitted review (~80K/day) is enqueued to a SageMaker async endpoint. Per-review latency is uncritical (max ~30s OK) because the user experience is "review submitted; sentiment shows up within a minute on the title page." The async endpoint absorbs bursts (review-bombing, new-volume-release spikes) without provisioning for peak.
- Batch transform — every night, all reviews flagged with
edit_event(vote count changed, helpful flag toggled, edited text, ABSA-aspect-schema bumped) — typically ~5M reviews — are re-scored in a single SageMaker batch transform job. This refreshes the OpenSearch index'saspectsfield and the per-volume aggregate DynamoDB items consumed byget_volume_reception.
The model is multi-task DeBERTa-v3-base (microsoft/deberta-v3-base, 140M parameters) with a shared encoder and two task heads:
Sentiment head: 3-class softmax over {positive, neutral, negative}
Aspect head: 18-class multi-label sigmoid over aspect_v3 schema
(each aspect outputs a polarity score in [-1, +1] when present)
The aspect_v3 schema (current; bumps to v4 mid-2026 per the aspect-emergence trigger):
art_style story_pacing character_development
translation_quality print_quality value_for_money
shipping binding_durability dialogue_naturalness
panel_layout worldbuilding emotional_impact
genre_authenticity volume_consistency cover_art
release_cadence collector_value content_warnings
The 18 aspects are versioned; the schema versioning is the architectural lever that makes v3 → v4 migrations non-disruptive (see §LLD-9).
End-to-End ML Lifecycle Diagram
flowchart TB
subgraph CORPUS[Review Corpus Layer]
RAW[Raw reviews<br/>~80K/day ingested<br/>50M cumulative]
SPAM[US-MLE-07 Spam Filter<br/>weekly retrain<br/>cleans corpus]
PII[PII Redactor<br/>Comprehend + regex]
CLEAN[Spam-clean + PII-redacted<br/>review corpus<br/>Iceberg, ap-northeast-1]
RAW --> SPAM --> PII --> CLEAN
end
subgraph LABELS[Label Layer]
L1[Vendor Appen<br/>~5K/quarter<br/>primary]
L2[Vendor Sama<br/>~500/quarter<br/>cross-vendor IAA only]
L3[LLM-Distill Sonnet<br/>~50K/quarter<br/>augmentation]
L4[Programmatic<br/>emoji + rating proxy<br/>low-confidence sentiment only]
LP[Label Platform<br/>aspect_version pinned<br/>Iceberg]
L1 --> LP
L2 --> LP
L3 --> LP
L4 --> LP
end
subgraph SCHEMA[Aspect Schema Layer]
TAX[aspect_taxonomy.yaml<br/>v3 effective_from 2025-Q4]
DISC[Aspect-Discovery Loop<br/>quarterly]
DET[Uncovered-Aspect<br/>Detector]
TAX -.feeds.-> LP
DISC --> TAX
DET -.signals.-> DISC
end
subgraph TRAIN[Training Pipeline - SageMaker Pipelines]
T1[1. Schema + Data Validation]
T2[2. PIT Feature Materialization]
T3[3. Stratified Split<br/>language x sentiment x aspect]
T4[4. Multi-Task Training<br/>g5.12xlarge 1 node DDP<br/>~6h spot ~$45]
T5[5. Offline Eval - 5 modes]
T6[6. Slice + Per-Aspect Analysis]
T7[7. Bias + PII Audit]
T8[8. Model Registration]
T1 --> T2 --> T3 --> T4 --> T5 --> T6 --> T7 --> T8
end
subgraph SERVE[Serving Layer]
REG[Model Registry<br/>v22 prod, v23 candidate]
ASYNC[SageMaker Async Endpoint<br/>g5.2xlarge AS<br/>~80K reviews/day]
BATCH[SageMaker Batch Transform<br/>nightly ~5M reviews<br/>g5.12xlarge x 4 nodes]
OS[(OpenSearch<br/>review-corpus index<br/>aspects field)]
DDB[(DynamoDB<br/>per-volume aggregates)]
REG --> ASYNC
REG --> BATCH
ASYNC --> OS
BATCH --> OS
BATCH --> DDB
end
subgraph CONS[Consumers]
MCP[Review-Sentiment MCP<br/>RAG-MCP-Integration/04]
REC[US-MLE-06 Recommendation<br/>aspect signal]
end
subgraph DRIFT[Drift Detection]
D1[Drift Hub<br/>PSI/KS/aspect-emergence]
D2[CloudWatch + SNS]
D1 --> D2
D2 -.triggers.-> T1
end
CLEAN --> T1
LP --> T1
TAX --> T1
T8 --> REG
OS --> MCP
DDB --> MCP
OS --> REC
ASYNC -.predictions.-> D1
BATCH -.predictions.-> D1
DET -.uncovered rate.-> D1
style CLEAN fill:#9cf,stroke:#333
style LP fill:#9cf,stroke:#333
style TAX fill:#fde68a,stroke:#92400e
style REG fill:#fd2,stroke:#333
style D1 fill:#fd2,stroke:#333
style PII fill:#f66,stroke:#333
Data Contracts and Volume
| Asset | Schema Version | Snapshot Cadence | Volume | Owner |
|---|---|---|---|---|
reviews_clean Iceberg table |
review_v4 | Continuous (Kinesis → Iceberg) | 50M cumulative; ~80K/day added | US-MLE-07 (provider) |
aspect_taxonomy.yaml |
aspect_v3 (current) | On taxonomy bump (quarterly–biannually) | 18 active aspects | Content-Ops + ML Eng (this story) |
absa_labels Iceberg table |
label_v6, keyed on (review_id, aspect_version) | Quarterly batches + continuous LLM-distill | ~150K labeled cumulative | ML Eng + Annotation Vendor PM |
absa_model model package |
model_v22 in prod | Quarterly + on aspect-bump trigger | 140M params, ~560MB FP32 / ~280MB FP16 | ML Eng (this story) |
review_aspect_scores OpenSearch field |
aspects.v3.* | Per-review on async; full refresh nightly batch | ~50M docs | ML Eng + Search Platform |
volume_reception DynamoDB |
per (manga_id, volume_no, aspect_version) | Nightly | ~2M items | ML Eng (this story) |
absa_drift_metrics CloudWatch |
n/a | 5min input, daily concept, daily aspect-emergence | ~30K data points/day | Drift Hub |
Model Registry + Promotion Gates
flowchart LR
R22[Registry v22<br/>prod<br/>aspect_v3 schema<br/>label_v=608]:::prod
R23[Registry v23<br/>candidate<br/>aspect_v3 schema<br/>label_v=655]:::cand
R23 --> G1{Stage 1<br/>Offline Gate}
G1 -->|pass| G2{Stage 2<br/>Shadow async<br/>72h}
G1 -.fail.-> RB1[Rollback v22]
G2 -->|pass| G3{Stage 3<br/>Canary<br/>10% async + 5% batch}
G2 -.fail.-> RB2[Rollback v22]
G3 -->|pass| G4[Stage 4<br/>Full Promote v23<br/>incl. backfill 5M reviews]
G3 -.fail.-> RB3[Rollback v22]
classDef prod fill:#2d8,stroke:#333
classDef cand fill:#fd2,stroke:#333
style RB1 fill:#f66,stroke:#333
style RB2 fill:#f66,stroke:#333
style RB3 fill:#f66,stroke:#333
style G4 fill:#2d8,stroke:#333
The four-stage promotion gate follows the standard from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1, with story-specific thresholds:
- Stage 1 (offline): sentiment macro-F1 ≥ 0.85; aspect-F1 ≥ 0.74; per-aspect F1 ≥ 0.70 (top-6) and ≥ 0.60 (rest); per-language slice gates pass; PII-leak audit passes (zero PII tokens in any prediction explanation, on a 1K sample).
- Stage 2 (shadow async): 72-hour shadow run on the async endpoint comparing v23 against v22 on incoming reviews; per-aspect agreement rate ≥ 70% (lower than intent's 75% because aspects are multi-label and minor disagreements are normal); p95 async latency ≤ 1.3× v22.
- Stage 3 (canary): 10% of async traffic + 5% of batch transform run on v23 for 7 days; downstream
Review-Sentiment MCPsummary-quality spot-check (content-ops 30 summaries/week) shows ≥ 90% agreement. - Stage 4 (full): promote, kick off backfill batch-transform over the 5M most-recent reviews to refresh OpenSearch
aspectsfield with v23 scores. v22 retained 30 days for rollback; OpenSearch keepsaspects.v3.*in two parallel sub-fields (one indexed by v22, one by v23) until backfill completes.
The cumulative-window pattern from primitive §7.2 applies here: the training set always includes the entire 150K labeled history, with importance weighting toward recent labels (decay τ = 18 months). This is the right choice because aspect signals are slow-moving; sliding-window training would discard rare-aspect examples.
Low-Level Design
1. PII Redaction — Boundary Discipline
Review text is user-generated and routinely contains PII: shipping addresses ("shipped to 1-2-3 Shibuya..."), order numbers ("order #112-7654321"), names ("thanks Tanaka-san"), email, phone. Per the cross-cutting concern in the README and the foundations doc, PII redaction happens at three boundaries: ingestion (before storage in reviews_clean), embedding (before any vectorisation), and nightly batch (before the model sees the text). The redactor is run once at ingestion and the redacted form is what is stored — but a defensive re-pass runs at training-set load and at batch-transform input, because residency obligations require fail-closed behaviour even on stale data.
# pii_redactor.py
from dataclasses import dataclass
import re
import boto3
# Compiled once; module-level for performance.
_ORDER_RE = re.compile(r"\b\d{3}-\d{7}\b")
_PHONE_JP_RE = re.compile(r"\b0\d{1,4}-\d{1,4}-\d{4}\b")
_PHONE_EN_RE = re.compile(r"\b\+?\d{1,3}[-.\s]?\(?\d{1,4}\)?[-.\s]?\d{1,4}[-.\s]?\d{4}\b")
_EMAIL_RE = re.compile(r"\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,}\b")
_JP_ADDRESS_HINT = re.compile(r"[都道府県市区町村]")
@dataclass
class RedactionReport:
spans_redacted: int
redaction_types: dict[str, int]
class ReviewPIIRedactor:
"""Three-layer redaction: regex (fast), Comprehend NER (deep), JP-specific.
Runs in ap-northeast-1 only. Comprehend client is bound to the region;
cross-region calls are denied at IAM level.
"""
def __init__(self, region: str = "ap-northeast-1"):
self.comprehend = boto3.client("comprehend", region_name=region)
# Comprehend Japanese PII detection requires the JP entity-recognizer model.
self.comprehend_jp = boto3.client("comprehend", region_name=region)
def redact(self, text: str, language: str) -> tuple[str, RedactionReport]:
report = RedactionReport(spans_redacted=0, redaction_types={})
text = self._regex_pass(text, report)
text = self._comprehend_pass(text, language, report)
if language in ("ja", "mixed"):
text = self._jp_specific_pass(text, report)
return text, report
def _regex_pass(self, text: str, report: RedactionReport) -> str:
for re_obj, label in [
(_ORDER_RE, "ORDER"),
(_EMAIL_RE, "EMAIL"),
(_PHONE_JP_RE, "PHONE_JP"),
(_PHONE_EN_RE, "PHONE_EN"),
]:
n = 0
text, n = re_obj.subn(f"[{label}]", text), n
text, count = re_obj.subn(f"[{label}]", text)
report.spans_redacted += count
report.redaction_types[label] = report.redaction_types.get(label, 0) + count
return text
def _comprehend_pass(self, text: str, language: str, report: RedactionReport) -> str:
if len(text) < 3 or len(text) > 5000:
return text
client = self.comprehend_jp if language == "ja" else self.comprehend
resp = client.detect_pii_entities(Text=text, LanguageCode=language[:2])
# Apply redactions back-to-front to avoid offset drift.
for ent in sorted(resp["Entities"], key=lambda e: -e["BeginOffset"]):
label = ent["Type"]
text = text[: ent["BeginOffset"]] + f"[{label}]" + text[ent["EndOffset"]:]
report.spans_redacted += 1
report.redaction_types[label] = report.redaction_types.get(label, 0) + 1
return text
def _jp_specific_pass(self, text: str, report: RedactionReport) -> str:
# Address tokens: redact spans containing 都/道/府/県/市/区/町/村
# plus surrounding katakana/kanji within +/-12 chars.
if _JP_ADDRESS_HINT.search(text):
# In production this calls a SudachiPy-based span detector;
# collapsed here for brevity.
text = jp_address_span_redactor(text)
report.spans_redacted += 1
report.redaction_types["ADDRESS_JP"] = report.redaction_types.get("ADDRESS_JP", 0) + 1
return text
The redactor's audit log writes (review_id, redaction_report, redactor_version) to a separate Iceberg table pii_redaction_audit. Quarterly audit reads a 1% sample and re-runs the redactor at the latest version; any review with new spans found is flagged and re-redacted across the corpus.
2. Aspect Schema as Versioned Artifact
The 18 aspects are not a constant — they are a product artifact per the aspect-emergence drift counterpart. The taxonomy lives in S3 and is signed off cross-functionally:
# s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml
taxonomy:
version: 3
effective_from: 2025-10-01
signed_off_by: ["content-ops-lead", "ml-eng-lead", "product-pm"]
aspects:
- id: art_style
label_en: "Art / illustration style"
label_jp: "作画"
synonyms_en: ["illustration", "drawing", "artwork", "panels"]
synonyms_jp: ["イラスト", "絵", "画風"]
example_phrases_en: ["the art is gorgeous", "panels are clean", "linework is detailed"]
example_phrases_jp: ["絵が綺麗", "作画が安定している"]
effective_from: 2024-01-01
stable: true
- id: story_pacing
label_en: "Story pacing"
label_jp: "ストーリー展開"
synonyms_en: ["pacing", "plot speed", "story flow"]
synonyms_jp: ["ストーリー", "展開", "テンポ"]
effective_from: 2024-01-01
stable: true
- id: translation_quality
label_en: "Translation quality"
label_jp: "翻訳品質"
synonyms_en: ["translation", "TL", "localization"]
synonyms_jp: ["翻訳", "ローカライズ"]
effective_from: 2024-01-01
stable: true
# ... 15 more stable aspects ...
candidate_aspects: # Discovery-loop output, not yet promoted
- id: ai_art_authenticity
candidate_since: 2026-Q1
promoted_in_version: null # under review for v4
sample_phrases_en: ["AI-generated colorist", "looks AI-touched", "filters look AI"]
sample_phrases_jp: ["AI仕上げっぽい", "AIっぽい彩色"]
- id: vertical_scroll_ux
candidate_since: 2025-Q4
promoted_in_version: null
Labels and predictions are schema-versioned: every absa_labels record is keyed on (review_id, aspect_version), and every prediction is written to aspects.v3.<aspect_id> (versioned sub-field) in OpenSearch. When v3 → v4 happens, OpenSearch keeps both aspects.v3.* and aspects.v4.* populated until the backfill completes. The MCP reads aspects.v3.* until the read-flag flips.
The bilingual aspect-name canonicalisation happens via the label_jp/synonyms_jp mapping. JP reviews mentioning 作画 map to the same internal aspect_id art_style; the sentiment polarity is computed in the JP context but the aspect is named consistently across languages for downstream aggregation.
3. Multi-Task Training Code
Training runs on g5.12xlarge × 1 node (4 × A10G GPUs, DDP) for ~6 hours on spot, ~$45/run. The training script is train.py:
# train.py
import os
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
AutoModel,
AutoTokenizer,
get_linear_schedule_with_warmup,
)
from sklearn.metrics import f1_score
import yaml
class AspectTaxonomy:
"""Parses aspect_taxonomy.yaml and provides ordering + canonical IDs."""
def __init__(self, taxonomy_path: str):
with open(taxonomy_path) as f:
tax = yaml.safe_load(f)["taxonomy"]
self.version = tax["version"]
self.aspects = [a for a in tax["aspects"] if a.get("stable", False)]
self.aspect_ids = [a["id"] for a in self.aspects]
self.id_to_idx = {aid: i for i, aid in enumerate(self.aspect_ids)}
self.num_aspects = len(self.aspect_ids)
class MultiTaskDeBERTa(nn.Module):
"""Shared DeBERTa-v3 encoder + sentiment head + per-aspect polarity head.
Implements primitive §3.2 (multi-task training) and the schema-versioned
aspect head from US-MLE-03 §LLD-9.
"""
def __init__(
self,
model_name: str = "microsoft/deberta-v3-base",
num_aspects: int = 18,
sentiment_classes: int = 3,
task_dropout: float = 0.15,
):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
hidden = self.encoder.config.hidden_size # 768
# Sentiment head: 3-class
self.sentiment_dropout = nn.Dropout(task_dropout)
self.sentiment_head = nn.Linear(hidden, sentiment_classes)
# Aspect head: per-aspect (presence_logit, polarity_score)
# presence: sigmoid; polarity: tanh
self.aspect_dropout = nn.Dropout(task_dropout)
self.aspect_presence = nn.Linear(hidden, num_aspects)
self.aspect_polarity = nn.Linear(hidden, num_aspects)
self.num_aspects = num_aspects
def forward(self, input_ids, attention_mask):
out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
# Pooled = mean over masked tokens (better than [CLS] for DeBERTa-v3)
mask = attention_mask.unsqueeze(-1).float()
pooled = (out.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
sent_logits = self.sentiment_head(self.sentiment_dropout(pooled))
asp_pres = self.aspect_presence(self.aspect_dropout(pooled))
asp_pol = torch.tanh(self.aspect_polarity(pooled))
return sent_logits, asp_pres, asp_pol
class MultiTaskLoss(nn.Module):
"""Sum of weighted sentiment CE + aspect BCE-with-logits + polarity MSE.
Weights:
sentiment: 1.0 (anchor task; never below this weight)
aspect_presence: 0.7 (multi-label; calibrated on val to balance with sentiment)
aspect_polarity: 0.5 (only computed where presence label is positive)
"""
def __init__(self, class_weights_sent: torch.Tensor):
super().__init__()
self.ce = nn.CrossEntropyLoss(weight=class_weights_sent)
self.bce = nn.BCEWithLogitsLoss(reduction="none")
self.mse = nn.MSELoss(reduction="none")
def forward(self, outputs, targets):
sent_logits, asp_pres, asp_pol = outputs
sent_y, asp_pres_y, asp_pol_y, asp_pol_mask = targets
loss_sent = self.ce(sent_logits, sent_y)
loss_pres = self.bce(asp_pres, asp_pres_y).mean()
# Polarity loss only where presence_y = 1 (mask)
loss_pol_raw = self.mse(asp_pol, asp_pol_y)
loss_pol = (loss_pol_raw * asp_pol_mask).sum() / asp_pol_mask.sum().clamp(min=1)
return loss_sent + 0.7 * loss_pres + 0.5 * loss_pol
def train_one_epoch(model, loader, loss_fn, optimizer, scheduler, device):
model.train()
running = 0.0
for batch in loader:
input_ids = batch["input_ids"].to(device)
attn = batch["attention_mask"].to(device)
sent_y = batch["sentiment"].to(device)
pres_y = batch["aspect_presence"].to(device)
pol_y = batch["aspect_polarity"].to(device)
pol_mask = batch["aspect_polarity_mask"].to(device)
outputs = model(input_ids, attn)
loss = loss_fn(outputs, (sent_y, pres_y, pol_y, pol_mask))
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
running += loss.item()
return running / max(1, len(loader))
def evaluate(model, loader, taxonomy: AspectTaxonomy, device) -> dict:
model.eval()
sent_preds, sent_truth = [], []
pres_preds, pres_truth = [], []
with torch.no_grad():
for batch in loader:
sent_logits, asp_pres, _ = model(
batch["input_ids"].to(device),
batch["attention_mask"].to(device),
)
sent_preds.extend(sent_logits.argmax(-1).cpu().tolist())
sent_truth.extend(batch["sentiment"].cpu().tolist())
pres_preds.extend((torch.sigmoid(asp_pres) > 0.5).cpu().int().tolist())
pres_truth.extend(batch["aspect_presence"].cpu().int().tolist())
sent_macro_f1 = f1_score(sent_truth, sent_preds, average="macro")
aspect_micro_f1 = f1_score(pres_truth, pres_preds, average="micro")
per_aspect_f1 = f1_score(pres_truth, pres_preds, average=None)
return {
"sentiment_macro_f1": float(sent_macro_f1),
"aspect_micro_f1": float(aspect_micro_f1),
"per_aspect_f1": dict(zip(taxonomy.aspect_ids, [float(x) for x in per_aspect_f1])),
}
def main():
# Hyperparameters from SageMaker channel
args = json.loads(os.environ.get("SM_TRAINING_ENV", "{}")).get("hyperparameters", {})
taxonomy = AspectTaxonomy(args["taxonomy_path"])
tokenizer = AutoTokenizer.from_pretrained(args.get("model_name", "microsoft/deberta-v3-base"))
train_ds = ABSAReviewDataset(
s3_uri=args["train_uri"], tokenizer=tokenizer, taxonomy=taxonomy, max_len=320
)
val_ds = ABSAReviewDataset(
s3_uri=args["val_uri"], tokenizer=tokenizer, taxonomy=taxonomy, max_len=320
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MultiTaskDeBERTa(num_aspects=taxonomy.num_aspects).to(device)
if torch.cuda.device_count() > 1:
model = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[int(os.environ.get("LOCAL_RANK", 0))],
output_device=int(os.environ.get("LOCAL_RANK", 0)),
)
class_weights = torch.tensor(args["sentiment_class_weights"], dtype=torch.float32).to(device)
loss_fn = MultiTaskLoss(class_weights_sent=class_weights)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False, num_workers=4)
total_steps = len(train_loader) * int(args["epochs"])
scheduler = get_linear_schedule_with_warmup(
optimizer, int(0.1 * total_steps), total_steps
)
best_f1 = 0.0
for epoch in range(int(args["epochs"])):
train_loss = train_one_epoch(model, train_loader, loss_fn, optimizer, scheduler, device)
metrics = evaluate(model, val_loader, taxonomy, device)
score = 0.6 * metrics["sentiment_macro_f1"] + 0.4 * metrics["aspect_micro_f1"]
print(f"epoch={epoch} loss={train_loss:.4f} score={score:.4f} metrics={metrics}")
# Checkpoint every epoch (spot reclaim discipline; primitive §3.2)
ckpt_path = f"/opt/ml/checkpoints/epoch_{epoch}.pt"
torch.save({"model": model.state_dict(), "metrics": metrics}, ckpt_path)
if score > best_f1:
best_f1 = score
torch.save(
{"model": model.state_dict(), "metrics": metrics, "taxonomy_version": taxonomy.version},
"/opt/ml/model/best.pt",
)
# Save the tokenizer + taxonomy snapshot alongside model
tokenizer.save_pretrained("/opt/ml/model/")
with open("/opt/ml/model/taxonomy_snapshot.yaml", "w") as f:
yaml.safe_dump({"version": taxonomy.version, "aspects": taxonomy.aspect_ids}, f)
if __name__ == "__main__":
main()
The SageMaker Pipelines wrapper:
# training_pipeline.py
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.parameters import ParameterString, ParameterInteger
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.huggingface.estimator import HuggingFace
def build_absa_pipeline(role: str, region: str = "ap-northeast-1") -> Pipeline:
label_version = ParameterInteger(name="MaxLabelVersion", default_value=0)
aspect_version = ParameterInteger(name="AspectVersion", default_value=3)
taxonomy_uri = ParameterString(
name="TaxonomyURI",
default_value="s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml",
)
estimator = HuggingFace(
entry_point="train.py",
source_dir="src/",
instance_type="ml.g5.12xlarge",
instance_count=1,
role=role,
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
use_spot_instances=True,
max_wait=28800,
max_run=21600, # 6h hard cap
checkpoint_s3_uri="s3://manga-ml-checkpoints-apne1/absa/",
checkpoint_local_path="/opt/ml/checkpoints",
distribution={"torch_distributed": {"enabled": True}},
hyperparameters={
"model_name": "microsoft/deberta-v3-base",
"epochs": 4,
"taxonomy_path": "/opt/ml/input/data/taxonomy/aspect_taxonomy.yaml",
"train_uri": "/opt/ml/input/data/train/",
"val_uri": "/opt/ml/input/data/val/",
"sentiment_class_weights": "[1.0, 1.4, 1.2]", # neutral up-weighted, neg too
},
)
train = TrainingStep(name="ABSAMultiTaskTrain", estimator=estimator)
# ... Stage 1-7 processing steps from the diagram
return Pipeline(
name="ABSAQuarterlyRetrain",
parameters=[label_version, aspect_version, taxonomy_uri],
steps=[train], # plus the validation/eval/bias/registration steps
)
4. Why Async Inference (and Not Real-Time or Plain Batch)
The choice of SageMaker async inference over real-time-endpoint or pure batch transform is deliberate:
| Option | Why ruled out / chosen |
|---|---|
| Real-time endpoint | Per-review inference latency is uncritical — the user submits a review and reads the title page later. Real-time would require provisioning for ~80K/day with peak bursts at ~600/min during volume drops; async absorbs bursts via SQS-backed queue. Real-time would also waste money idling between bursts. |
| Pure batch transform | A nightly-only pipeline means new reviews wait up to 24 hours before showing up in aspects.* field. This breaks the user expectation that review submission affects the title-page summary "soon." |
| Async inference (chosen) | Per-review p95 ≤ 5s; queue absorbs review-bombing bursts up to ~5K/min without provisioning for peak; ScaleDownToZero behaviour cuts cost during JP off-peak (07:00–18:00 JST when review submission is light); response payload written to S3 and consumed by an OpenSearch-update Lambda that updates the document's aspects field. |
Plus a nightly batch transform for the ~5M reviews that have edit-events (see §LLD-5). The two modes share the same model artifact but use different SageMaker resources and different scaling profiles.
5. Async Endpoint Configuration
# endpoint_config.py
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
def deploy_absa_async_endpoint(model_data_uri: str, version: str) -> str:
model = HuggingFaceModel(
model_data=model_data_uri,
role=role,
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
env={
"HF_MODEL_ID": "absa_multitask",
"TAXONOMY_URI": "s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml",
"MAX_INPUT_LEN": "320", # token cap; truncate long reviews
"MODEL_VERSION": version,
"PII_REDACTOR_VERSION": "v4", # defensive re-redaction at inference
},
)
async_cfg = AsyncInferenceConfig(
output_path=f"s3://manga-ml-async-out-apne1/absa/v{version}/",
max_concurrent_invocations_per_instance=8,
notification_config={
"SuccessTopic": "arn:aws:sns:ap-northeast-1:ACCT:absa-async-success",
"ErrorTopic": "arn:aws:sns:ap-northeast-1:ACCT:absa-async-error",
},
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
endpoint_name=f"absa-async-{version}",
async_inference_config=async_cfg,
)
# Scale-to-zero on JP off-peak; scale up to 4 instances on peak
autoscaling = boto3.client("application-autoscaling")
autoscaling.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/absa-async-{version}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=0, # async allows 0
MaxCapacity=4,
)
autoscaling.put_scaling_policy(
PolicyName="ABSAAsyncBacklogTracking",
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/absa-async-{version}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 5.0, # 5 messages per instance backlog
"CustomizedMetricSpecification": {
"MetricName": "ApproximateBacklogSizePerInstance",
"Namespace": "AWS/SageMaker",
"Statistic": "Average",
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60,
},
)
return predictor.endpoint_name
6. Nightly Batch Transform (5M Reviews)
# nightly_batch.py
from sagemaker.transformer import Transformer
def run_nightly_batch(model_name: str, input_uri: str, output_uri: str):
"""Re-scores reviews that had an edit_event in the last 24h.
Input: ~5M reviews/night sharded into 200 input files of ~25K reviews each.
Compute: g5.12xlarge x 4 nodes; ~2.5h wall clock.
"""
transformer = Transformer(
model_name=model_name,
instance_count=4,
instance_type="ml.g5.12xlarge",
strategy="MultiRecord",
max_concurrent_transforms=8,
max_payload=6,
output_path=output_uri,
assemble_with="Line",
accept="application/jsonlines",
env={"TAXONOMY_URI": "s3://manga-ml-taxonomy-apne1/absa/v3/aspect_taxonomy.yaml"},
)
transformer.transform(
data=input_uri,
content_type="application/jsonlines",
split_type="Line",
job_name=f"absa-nightly-{datetime.utcnow():%Y%m%d}",
)
transformer.wait()
The job is scheduled by EventBridge for 02:00 JST (post-spam-retrain Friday from US-MLE-07; other days a no-op spam-retrain check). Output is a JSONL stream consumed by an OpenSearch-bulk Lambda that updates each review's aspects.v3.* field; the Lambda also recomputes the per-volume aggregate DynamoDB items consumed by get_volume_reception.
7. Quarterly Vendor Relabel — the 5K Loop
Every quarter, the annotation vendor (Appen primary; Sama for cross-vendor IAA) labels 5,000 reviews:
| Bucket | % of 5K | Selection |
|---|---|---|
| Active learning — high model uncertainty | 40% (2,000) | Predictions with sentiment-confidence < 0.6 OR aspect-presence-prob in [0.4, 0.6] |
| High-impact — popular titles | 25% (1,250) | Reviews on titles in top-1000 catalog by 90-day query volume |
| Random for distribution coverage | 20% (1,000) | Stratified sample by language × genre × format |
| Drift-flagged reviews | 10% (500) | Reviews flagged by the input-drift detector as far-from-training-centroid |
| Aspect-emergence candidates | 5% (250) | Reviews flagged by the uncovered-aspect detector (§LLD-9) |
The 40/25/20/10/5 mix mirrors the 50/30/20 active-learning mix from the domain-shift drift counterpart, adapted for ABSA. Cross-vendor IAA is computed on a 10% shared subset (500 reviews labeled by both Appen and Sama). Any aspect with cross-vendor κ < 0.65 EN or < 0.62 JP is flagged for guideline clarification before the labels enter training.
LLM-distilled labels (Claude Sonnet) supply ~50K/quarter for data augmentation. The capping rule from the foundations doc applies: LLM labels are at most 25% of any class's training examples. The LLM-distill prompt is templated and versioned; quarterly precision audit on a 500-review vendor sample must hit ≥ 0.90 or the LLM labels for that quarter are excluded.
8. Bilingual JP/EN Handling
Three concerns specific to JP/EN:
Aspect lexicon canonicalisation. The taxonomy stores label_jp and synonyms_jp per aspect (§LLD-2). At label time, JP reviews are annotated against the JP labels (作画 → art_style, ストーリー → story_pacing); the canonical aspect_id is the same English ID, so downstream aggregation is bilingual-consistent. This is the platform-shared aspect ontology — US-MLE-06 recommendation reads aspects.v3.art_style regardless of the original review language.
Tokenization. DeBERTa-v3-base uses SentencePiece, which handles JP better than DistilBERT-multilingual's WordPiece (US-MLE-01's pain point). No external pre-tokenizer is needed for sentiment, but ABSA span detection benefits from a Sudachi pre-pass that surfaces aspect-bearing noun phrases — this is a feature engineered into training data (aspect_anchor_tokens), not a serving-time external call. The full DeBERTa-v3 inference is 320 tokens max; reviews longer than this are truncated with the sentiment-bearing tail preserved (the closing summary line of a review is usually higher signal than the opening setup).
Honorifics and politeness. Same as US-MLE-01's normalisation — 〜していただけますでしょうか style polite negative (rare, often satirical) versus クソ direct negative (common, blunt) — both encode the same sentiment but the formal register is rare in training data. The data-augmentation script generates politeness-shifted versions for sentiment-balanced training.
Ironic positive ("尊い" = "precious/holy" — used for emotionally devastating chapters that are positively-received). This is the JP analogue of "this kills me" / "no thoughts head empty" from the domain-shift counterpart. The model handles it with mixed success; the abstain path (§LLD-11) catches the worst cases.
9. Aspect-Schema Migration: v3 → v4 Without Retraining the 18 Stable Aspects
The architectural lever that makes schema bumps cheap is head-modular re-training. The shared encoder + sentiment head + aspect head structure means that adding a new aspect (e.g., ai_art_authenticity) is a delta operation:
flowchart LR
V3[v3 model<br/>encoder + sent_head + aspect_head 18-dim]
DELTA[Delta training:<br/>freeze encoder<br/>freeze sentiment_head<br/>freeze 18 stable aspect dims<br/>train ONLY new aspect dims]
V4[v4 model<br/>encoder + sent_head + aspect_head 20-dim]
V3 --> DELTA --> V4
style V3 fill:#9cf,stroke:#333
style DELTA fill:#fde68a,stroke:#92400e
style V4 fill:#2d8,stroke:#333
Concretely:
- Content-ops + ML-eng promote 2 candidate aspects (e.g.,
ai_art_authenticity,vertical_scroll_ux) to taxonomy v4 witheffective_from=2026-07-01. - Differential re-labelling per the aspect-emergence counterpart §"How do you keep re-labeling cost bounded?": only re-label reviews where keyword heuristics suggest the new aspect (
"AI","AI仕上げ","AI-touched"forai_art_authenticity;"scroll","スクロール"forvertical_scroll_ux). Roughly 8K reviews re-labelled per new aspect. - Build the v4 model: load v3 weights, expand
aspect_presenceandaspect_polaritylinear layers from(768 → 18)to(768 → 20)by appending fresh random rows for the 2 new aspects. Freeze all other parameters. - Train for 1 epoch on the delta-labelled set (~16K examples) with a higher learning rate (5e-5) on the new rows only. ~30 minutes on
g5.2xlarge, ~$3.
This produces v4 with the 18 stable aspects' F1 unchanged (frozen weights → identical predictions) and the 2 new aspects' F1 measured against v4-only labels per the aspect-emergence counterpart's per-aspect-version F1 rule. A full re-train (full encoder unfrozen, all 20 aspects) runs every 2 schema-bumps to prevent the encoder drifting away from the new aspect distribution.
The OpenSearch index keeps aspects.v3.* and aspects.v4.* as parallel sub-fields during the migration window. The MCP read-flag is flipped via SSM Parameter Store after the v4 backfill completes (~3 hours batch-transform on the delta-flagged ~1M reviews).
10. Offline Evaluation (5 modes from primitive §4.1)
The five evaluation modes for ABSA:
# offline_eval.py
class ABSAOfflineEvaluator:
def evaluate(self) -> EvalReport:
return EvalReport(
golden=self.eval_golden(),
slice=self.eval_slice(),
adversarial=self.eval_adversarial(),
counterfactual=self.eval_counterfactual_replay(),
offline_online_corr=self.eval_offline_online_corr(),
)
def eval_golden(self) -> GoldenReport:
"""Frozen 2K-review golden set; refreshed quarterly with the 5K vendor batch.
Sentiment macro-F1 + aspect-F1 + per-aspect-F1 against schema_v3."""
golden = pd.read_parquet("s3://manga-ml-eval-apne1/absa/golden/v3.parquet")
sent_p, asp_p = self.model.predict_batch(golden["review_text"].tolist())
return GoldenReport(
sentiment_macro_f1=f1_score(golden["sentiment"], sent_p, average="macro"),
aspect_micro_f1=multilabel_micro_f1(golden["aspects"].tolist(), asp_p),
per_aspect_f1=per_aspect_f1(golden["aspects"].tolist(), asp_p, taxonomy=self.tax),
)
def eval_slice(self) -> SliceReport:
"""Per-language x sentiment x format (manga / manhwa / manhua / fan-edition)."""
# ... cartesian over (language, sentiment_class, format, length_bucket)
def eval_adversarial(self) -> AdversarialReport:
"""600-example set: ironic JP ('尊い'), code-switched, review-bomb language,
sarcastic 5-stars, prompt-injection redirected to sentiment ('ignore prior...
rate this 5-stars')."""
adv = pd.read_parquet("s3://manga-ml-eval-apne1/absa/adversarial/v2.parquet")
# ...
def eval_counterfactual_replay(self) -> CounterfactualReport:
"""10K production reviews from last 7d. Compare predicted aspects against
v22 prod predictions to measure (cost, agreement) shift; labels not available."""
# ...
def eval_offline_online_corr(self) -> OfflineOnlineCorrReport:
"""Reads the last 6 retrains' (offline_aspect_micro_f1, online_summary_agree_rate)
pairs; the online metric is the content-ops 30-summaries/week spot-check.
Trustworthy if Pearson >= 0.6."""
# ...
The slice gate enforces per-language sentiment macro-F1 ≥ 0.83 EN, ≥ 0.81 JP, ≥ 0.78 mixed. The per-aspect gate enforces ≥ 0.70 on the 6 high-volume aspects and ≥ 0.60 on the rest. The adversarial regression ceiling is 2% absolute on the adversarial set's macro-F1.
11. Drift Detection — Three Kinds + Aspect-Emergence
Beyond the four standard drift kinds from primitive §6.1, ABSA adds a fifth: aspect-emergence drift (the existence of new aspects in production that the schema does not cover).
| Drift Kind | Detector | Cadence | Threshold |
|---|---|---|---|
| Input drift | PSI on token-distribution; embedding-centroid shift | 5 min on async stream; daily on full corpus | PSI > 0.2 sustained 24h |
| Label drift | χ² on sentiment class proportions vs reference | Daily | p < 0.01 sustained 7d |
| Prediction drift | KS on aspect-presence rates per aspect | Daily | KS > 0.15 sustained 24h |
| Concept drift | Δ-F1 on rolling-2-week labeled subset | Daily | Δ-F1 > 0.03 sustained 7d |
| Aspect-emergence | Uncovered-aspect rate (model coverage_ratio < 0.3); LLM open-vocab candidate clustering on 10K-review monthly sample | Daily on coverage; monthly on candidate clustering | uncovered_rate > 4% sustained 14 days |
The aspect-emergence detector implements the aspect-emergence counterpart's coverage-ratio path:
# aspect_emergence_detector.py
def daily_coverage_check(date: str):
sample = sample_reviews(date, n=20_000, stratified_by="format")
spans, hints = batch_predict_with_coverage(sample)
uncovered_rate = (
sum(1 for h in hints if h is not None and h.coverage_ratio < 0.3) / len(sample)
)
cw_metric.put("UncoveredAspectRate", uncovered_rate, dimensions={"format": ...})
if uncovered_rate > 0.04 and sustained_for_days(14):
sns.publish(
TopicArn="arn:...:absa-aspect-emergence",
Message=json.dumps({"rate": uncovered_rate, "sample_phrases": top_uncovered_phrases(hints)}),
)
The SNS message triggers a planning-level schema-bump cycle (not an automated retrain); content-ops + ML-eng convene, run the LLM open-vocab discovery on a 10K sample, and decide whether to promote candidates to v4. This explicitly ties to aspect-emergence drift counterpart: the trigger is in code; the response is human-in-the-loop.
12. Retraining Trigger Logic
Three retrain triggers:
- Quarterly scheduled: EventBridge fires on the first Monday of each quarter. Pulls the new 5K vendor labels + accumulated LLM-distill (capped) and re-trains. Full unfreeze.
- Aspect-emergence triggered: SNS topic
absa-aspect-emergencefires when the uncovered-aspect rate exceeds 4% sustained 14 days. Triggers a delta retrain after the human review confirms new aspects to promote (§LLD-9). - Concept-drift triggered: SNS fires when Δ-F1 on the rolling-2-week labeled subset exceeds 0.03 sustained 7 days. Forces a quarterly retrain ahead of schedule.
Same check_promotion_eligibility gate as US-MLE-01 (global_ml_freeze, per-model promotion flag, no-concurrent-retrains).
13. Cross-Story Dependency: Reading the Spam-Clean Corpus
Per the README dependency graph, this story reads the spam-cleaned review corpus from US-MLE-07 — never the raw stream. The contract:
- US-MLE-07's weekly retrain produces
reviews_cleanIceberg snapshot every Friday 06:00 JST. - This story's quarterly training reads
reviews_cleanat the latest snapshot ≥ 7 days old (to give US-MLE-07 a buffer to roll back if its weekly retrain fails its own promotion gate). - This story's nightly batch transform reads
reviews_cleanat the most-recent snapshot. - A US-MLE-07 promotion-rollback (spam classifier rolled back) automatically re-triggers this story's drift hub to flag potential training-data poisoning; if the rollback affects > 5% of the last 24h's reviews, the nightly batch is delayed and re-run after
reviews_cleanis re-snapshotted.
This is the cross-story coordination that the README flagged as a sequencing requirement.
Monitoring & Metrics
| Category | Metric | Target | Alarm Threshold |
|---|---|---|---|
| Online — Async | p50 latency | ≤ 1.5 s | > 3 s 5min |
| p95 latency | ≤ 5 s | > 8 s 5min | |
| Backlog size | < 100 messages | > 500 sustained 10min | |
| Error rate | < 0.1% | > 1% 5min | |
| Online — Batch | Nightly wall-clock | ≤ 4 h | > 6 h |
| Per-review cost | ≤ $0.00012 | > $0.0002 | |
| Quality — Sentiment | Macro-F1 (3-class) | ≥ 0.85 | < 0.80 quarterly |
| Per-language EN/JP/mixed | ≥ 0.83 / 0.81 / 0.78 | drop > 0.03 | |
| Quality — Aspect | Aspect-F1 micro | ≥ 0.74 | < 0.70 quarterly |
| Per-aspect F1 (top-6) | ≥ 0.70 each | < 0.65 any | |
| Per-aspect F1 (rest) | ≥ 0.60 each | < 0.55 any | |
| Quality — MCP feedback | Content-ops 30-summary/wk agreement | ≥ 90% | < 85% 4 weeks |
| Drift | PSI top-30 tokens | < 0.2 | > 0.2 24h |
| Aspect-presence KS per aspect | < 0.15 | > 0.15 24h | |
| Uncovered-aspect rate | < 4% | > 4% 14d → SNS | |
| Δ-F1 vs reference | < 0.03 | > 0.03 7d | |
| Cost | Quarterly training $ | ≤ $50 | > $80 |
| Async serve $/1k | ≤ $0.18 | > $0.30 | |
| Batch nightly $ | ≤ $90 | > $150 | |
| Pipeline | Quarterly retrain success rate | ≥ 95% | < 90% 1y rolling |
| PII | PII-leak audit (1K samples) | 0 unredacted spans | any positive |
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Aspect schema falls behind cultural emergence | MCP summaries miss what readers actually say (per aspect-emergence counterpart) | Quarterly aspect-discovery loop; uncovered-aspect detector; SNS trigger; head-modular delta retrain |
| LLM-distill labels concentrate Sonnet's biases | Training set inherits LLM failure modes (irony blindness) | 25% per-class cap; quarterly precision audit ≥ 0.90 vs vendor; cap on JP-specific irony aspects |
| JP politeness register under-represented | JP polite-negative reviews mis-classified as neutral | Augmentation generates politeness-shifted versions; per-register slice gate |
| Cross-vendor IAA collapses mid-quarter | Subtle quality regression in next training cycle | Cross-vendor IAA computed on 10% shared subset; per-aspect κ tracked; vendor billed for re-annotation if κ < contractual floor |
| Spot-instance reclaim during 6h DDP run | Wasted compute, missed quarterly window | Checkpoint every epoch + 500 steps to S3; max_wait=28800s; on-demand fallback if cumulative reclaim > 90 min |
| Aspect-schema bump breaks OpenSearch read consumers | MCP returns empty aspects.v3.* while v4 backfilling |
Parallel aspects.v3.* + aspects.v4.* sub-fields; SSM read-flag flip after backfill complete; rollback by flipping back |
| PII redactor regresses on JP addresses | Privacy incident | Quarterly audit re-runs latest redactor on 1% sample; any new spans found triggers full re-redaction batch |
| Nightly batch overlaps with US-MLE-07 weekly retrain | Race condition on reviews_clean snapshot |
Cross-story snapshot pinning: this story reads snapshot ≥ 7 days old for training; nightly batch reads latest |
| Review-bombing burst saturates async queue | Backlog grows; new reviews delay > 10 min | Auto-scale to 4 instances on backlog > 5/instance; review-bomb anomaly detector pauses ingest; SEV-3 page |
| Aspect-emergence false-positive (LLM hallucinates aspect) | Wasted relabel budget | Two-filter discipline: frequency threshold across N reviews + lightweight semantic filter; humans validate all candidates |
| Schema mismatch between training and serving | Predictions written to wrong sub-field | taxonomy_snapshot.yaml saved alongside model; serving handler refuses to start if endpoint env TAXONOMY_URI version != model's snapshot |
Deep Dive — Why This Works at Amazon-Scale on the Manga Workload
The MangaAssist ABSA model is medium-sized (140M params) and serves an enormous corpus (50M cumulative reviews; ~5M re-scored nightly; ~80K new/day). Five workload properties make this design the right shape rather than the obvious "fine-tune BERT and call it done":
Workload property 1: aspect schema is alive, not constant. The catalog grows (manhwa, manhua, fan-editions, AI-art controversies); reader concerns evolve (vertical_scroll_ux was not an aspect in 2024; ai_art_authenticity was not an aspect in 2025). A frozen-taxonomy ABSA model is structurally blind to emergence — held-out F1 cannot detect missing aspects by construction. The discovery loop + uncovered-aspect detector + head-modular delta retrain is the only way to keep up without re-training the whole encoder every quarter. This is the single biggest divergence from US-MLE-01: intent has 14 stable classes; ABSA has 18+ that grow.
Workload property 2: latency budget is asymmetric across modes. New-review async tolerates ~5s p95; nightly batch tolerates ~4h wall-clock; the hot-path MCP read tolerates ~50ms (it reads pre-computed scores from OpenSearch). The choice of async + batch (not real-time) follows directly: each mode is provisioned for its actual latency budget, and the model is the same artifact. A naïve "deploy everything as real-time" would over-provision by 10× during JP off-peak.
Workload property 3: the 50M corpus is the asset, not the model. The OpenSearch review-corpus index with aspects.v3.* pre-computed is the production surface. The model exists to populate that index. This inverts the usual "model = product" framing — here, the model is a data-processing function that runs every 24 hours over millions of records. Every architectural decision (head-modular schema migration, parallel sub-field indexing, backfill orchestration) is in service of keeping the index fresh and consistent without query-time cost.
Workload property 4: bilingual aspect canonicalisation is a platform contract. The art_style aspect_id is the same whether the review was JP (作画) or EN (art_style). Downstream consumers (US-MLE-06 recommendation, the MCP's compare_sentiment tool) aggregate across languages and require consistent IDs. The taxonomy.yaml's bilingual aliases are not ergonomic sugar — they are the contract that makes cross-language summaries possible. Skipping this would force every consumer to re-implement aspect canonicalisation, a distributed-monolith failure pattern.
Workload property 5: spam-clean corpus is a hard upstream dependency. ABSA quality is bounded by spam recall. If US-MLE-07 misses 5% of spam, that 5% poisons training (spam reviews tend to be extreme-positive) and skews the polarity distribution toward false positive. The cross-story snapshot-pinning (read US-MLE-07's snapshot ≥ 7 days old for training; latest for nightly batch) is the structural protection that makes this story's quality contract independent of US-MLE-07's current quality and dependent only on US-MLE-07's promoted quality.
These five properties together explain why the head-modular schema migration + async/batch dual-mode + bilingual canonicalisation + cross-story snapshot pinning machinery exists. A naïve quarterly retrain over a frozen taxonomy with real-time serving would silently regress on emerging aspects, over-provision compute by 10×, and inherit US-MLE-07's transient quality bugs as training poisoning.
Real-World Validation
Industry analogues. Amazon's product-review aspect extraction team (the team behind the "Product features mentioned" section of product detail pages) uses an analogous multi-task architecture with a versioned aspect taxonomy and head-modular retraining for new aspects. Their published 2024 architecture overview describes the same uncovered-aspect surfacing pattern (a phrase like "buyers also discussed [phrases]") under a different name. Yelp's review-aspect team published a 2023 paper describing aspect-schema versioning with effective_from semantics very close to this story's aspect_taxonomy.yaml. Booking.com's review-NLP team uses a Sonnet-distilled augmentation loop with similar capping rules (their cap is 30%; this story's is 25%, slightly stricter because manga reviews are more domain-specific than hotel reviews).
Math validation — async serving cost. On ml.g5.2xlarge at $1.515/hr in ap-northeast-1, with auto-scale 0–4 averaging 1.2 instances: $1.515 × 1.2 × 24 × 30 = $1,309/month. Per-review cost: $1,309 / (80,000 × 30) = $0.000545 per review, or $0.55 per 1K reviews. Below the $0.18/1K target only after factoring batch transform amortisation, but the budget is async + batch combined (next row).
Math validation — nightly batch cost. g5.12xlarge × 4 nodes at $5.672/hr × 4 × 2.5h × 30 nights/month = $1,701/month. Per-review batch cost: $1,701 / (5,000,000 × 30) = $0.0000113 per review, or $0.011 per 1K. Combined async + batch: $3,010/month total, against ~50M cumulative reviews touched/month → $0.060 per 1K reviews touched. Fits the target.
Math validation — training cost. g5.12xlarge at $5.672/hr × 6h spot (~70% spot discount) = $5.672 × 6 × 0.30 = $10.21/run + ~$35 in spot-reclaim recovery overhead → ~$45/run quarterly. Yearly: $180. Immaterial against the $36K/year combined async+batch.
Math validation — label volume. 5K vendor + 50K LLM-distill (capped to 25%) per quarter → effective training-set growth = 5K + min(50K, 0.25 × 150K) = 5K + 37.5K = ~42K per quarter, or ~28% per-quarter cumulative-set turnover. This is a healthy refresh rate for a quarterly cumulative-window model — it sees enough new signal to track drift without overwhelming the prior. The 18-month exponential decay τ on training weights means labels older than 36 months contribute < 25% weight, which gives the model effective "cumulative with forgetting" semantics.
Math validation — schema-migration delta retrain. Adding 2 new aspects: ~16K delta-labeled examples, 1 epoch on g5.2xlarge (1× A10G), batch size 32, ~500 steps, ~30 min × $1.515/hr = $0.76 + spot-reclaim buffer ~$3 total. Compared to a full retrain at $45, the delta is 15× cheaper, which is what makes the head-modular pattern operationally viable for quarterly schema bumps.
Cross-Story Interactions
| Edge | Direction | Contract | Conflict mode |
|---|---|---|---|
| US-MLE-07 (spam) → US-MLE-03 | reviews_clean Iceberg snapshot |
This story trains on snapshot ≥ 7 days old; reads latest for nightly batch | If US-MLE-07's promotion rolls back affecting > 5% of last 24h, this story delays nightly batch |
| US-MLE-03 → US-MLE-06 (recommendation) | aspect scores as features | aspects.v3.* bilingual-canonical IDs; pinned to absa_model_v22 in cache key |
If US-MLE-03 promotes v23 mid-week, US-MLE-06 reads stale v22 scores until OpenSearch backfill completes; tolerable because aspect drift is slow |
US-MLE-03 → RAG-MCP-Integration Review-Sentiment MCP |
aspects.v3.* pre-computed in OpenSearch; per-volume aggregates in DynamoDB |
MCP reads aspects.<active_version>.* controlled by SSM flag |
Schema migration v3 → v4: parallel sub-fields populated; flag flip is atomic |
| US-MLE-03 ↔ US-MLE-01 (intent) | both consume cross-vendor IAA infra; both report bilingual slice metrics | IAA platform contract; per-language slice gates | None (parallel users of the same platform) |
| US-MLE-03 ← Ground-Truth-Evolution ML-02 (sentiment domain shift) | drift detection patterns | drift-eval set, regression set, active-learning 50/30/20 mix adapted to 40/25/20/10/5 | The drift counterpart is the "what breaks over time" lens for this story |
| US-MLE-03 ← Ground-Truth-Evolution ML-06 (aspect emergence) | schema versioning + uncovered-aspect path | aspect_taxonomy.yaml with effective_from; uncovered-aspect detector | The aspect-emergence counterpart is the trigger source for v3 → v4 migration |
| US-MLE-03 → Cost-Optimization US-04 (compute) | async + batch SageMaker compute | scale-to-zero on async; spot for batch + training | Async backlog SLA must hold even when scale-to-zero policy is more aggressive |
Rollback & Experimentation
Shadow Mode Plan
- Duration: 72 hours minimum on async (3× longer than US-MLE-01 because per-review aspect predictions are multi-label and noisier; need 3× more samples for power).
- Sample size: ~80K async + 15M batch over 3 days = ~15M shadow predictions; statistical power overwhelming.
- Pass criteria: per-aspect agreement with v22 between 60% and 90% (multi-label; lower than sentiment's 70-90% range); p99 async latency ≤ 1.3× v22; batch-transform wall-clock ≤ 1.2× v22.
- Slice criteria: per-language agreement within ±5pp of global; per-format agreement (manga / manhwa / manhua) within ±7pp; uncovered-aspect rate ≤ v22's rate (regression here means the new model is worse at coverage, not better).
Canary Thresholds
- Phase A (10% async, 3 days): per-aspect accuracy on the running fresh-label sample ≥ v22 baseline; sentiment macro-F1 not regressed.
- Phase B (10% async + 5% batch, 4 days): same as Phase A plus content-ops 30-summary spot-check showing ≥ 90% agreement; per-volume aggregate DynamoDB items show no abrupt shifts (KS on per-volume avg sentiment ≤ 0.10).
- Phase C (full promote with backfill): after the 5M-review backfill completes, MCP read-flag flips; v22's
aspects.v3.*retained 30 days for rollback.
Kill-Switch Flags
absa_promotion_enabled(default: false; SSM Parameter Store/manga-ml/absa/promotion_enabled).absa_async_paused— when true, async endpoint queue drains but no new invocations; new reviews stack in SQS DLQ for replay after unpause.absa_batch_paused— when true, nightly batch skipped; OpenSearchaspects.*field staleness grows.absa_taxonomy_version_active— SSM string; controls whichaspects.vX.*sub-field the MCP reads. Atomic flip during v3 → v4 migration.global_ml_freeze— overrides all of the above. Per README kill-switch precedence.
Quality Regression Criteria (Hard Rollback)
A canary that satisfies any of these is automatically rolled back:
- Sentiment macro-F1 on the running fresh sample regresses by > 0.03 absolute on any 24h window.
- Per-aspect F1 on any of the top-6 aspects regresses by > 0.05 absolute on any 24h window.
- Async p99 latency exceeds 8s for any 1h window.
- Async error rate exceeds 1% for any 5min window.
- Batch-transform fails (incomplete output) for any single nightly run.
- Content-ops 30-summary agreement drops below 80% for any 1-week window.
- PII-leak audit detects any unredacted spans in 1K-sample inference output.
Rollback path: SageMaker async endpoint update (model package v22) ≤ 5 minutes; nightly batch job ARN flipped to v22 in EventBridge; MCP read-flag stays on aspects.v3.* if v23 was a same-version-schema candidate, or flips back to aspects.v3.* if v23 was a v4-schema candidate.
Multi-Reviewer Validation Findings & Resolutions
S1 — Must Fix Before Production
ML Scientist lens: The multi-task loss weighting (1.0 / 0.7 / 0.5) is an opening guess, not a calibrated value. With sentiment as the anchor task and aspect as multi-label sigmoid, the gradient magnitudes are not naturally comparable — a fixed weighting will favour whichever head's gradients are larger by happenstance. Resolution: Implement GradNorm-style adaptive task weighting that rebalances each epoch based on per-head loss-decrease rate; treat the (1.0, 0.7, 0.5) values as initial conditions, not constants. Documented in train.py with the GradNorm implementation hidden behind a --adaptive-weights flag, default true.
Application Security / Privacy lens: PII redaction at ingestion is good but the training-set materialization re-reads from reviews_clean and trusts the redaction was already applied. If the redactor was buggy at the time the review was ingested, the bug propagates into training. Resolution: Defensive re-redaction at training-set load using the latest redactor version; quarterly audit re-runs the latest redactor on the full corpus and flags any reviews with new spans found, triggering surgical re-write.
SRE lens: A 6-hour DDP training job on spot is fine on average but the worst-case-reclaim profile is bad — three reclaims at hour 4, 5, 5.5 wastes ~90 minutes of compute and pushes outside the quarterly window. Resolution: On-demand fallback policy — if cumulative reclaim time exceeds 90 minutes, the next checkpoint resume runs on-demand. Documented in Runbooks/absa-spot-fallback.md. Same pattern as US-MLE-01.
Data Engineering lens: The aspect-emergence detector samples 20K reviews/day for coverage-ratio computation — at 80K/day that's 25% of new reviews. Over a quarter that's 7M coverage-ratio computations, which is materially expensive. Resolution: Reduce daily sample to 5K, increase the sustained-window from 14 days to 21 days (more days × fewer samples preserves statistical power); validated against historical 2-year coverage data showing equivalent detection lag.
S2 — Address Before Scale
Data Engineering lens (S2): OpenSearch index with parallel aspects.v3.* + aspects.v4.* sub-fields doubles storage during migration windows. At 50M docs × ~4KB aspects payload × 2 = ~400GB of duplicate data. Resolution: schema-migration TTL policy — aspects.<old_version>.* is dropped 30 days post-flip; storage rolls back. Documented in Runbooks/absa-schema-migration-storage.md.
FinOps lens: Combined async + batch + training cost is ~$3,200/month — under budget but trending up as catalog grows. Resolution: ABSA cost participates in the Cost-Optimization quarterly review (Cost-Optimization US-04 compute optimization). Reserve the option to move nightly batch to Graviton-based instances if/when SageMaker offers a cost-effective Graviton path for DeBERTa-v3 (currently A10G is faster per dollar but the gap is closing).
ML Scientist lens (S2): Offline-online correlation between aspect-F1 and content-ops summary-agreement has been 0.52 over the last 4 retrains — below the 0.6 trustworthy floor. This means offline aspect-F1 is only a partial leading indicator of the online quality the MCP cares about. Resolution: Per the foundations doc §4.3, ABSA is flagged as "always extended canary" — Phase B runs at 5% batch coverage for 7 days instead of 4; the content-ops summary-agreement metric is the primary online gate. Tracked for re-evaluation each quarter.
S3 — Future Work
Principal Architect lens: The 18-aspect taxonomy is shared across manga, manhwa, manhua. Aspects emerge differently per format (vertical scroll for webtoons; binding for collectors). A federated taxonomy per the aspect-emergence counterpart's architect-level escalation 1 — shared core aspects + per-format extensions — may be the right v2 architecture. Tracked as a v2 backlog item; revisit after fan-fiction support lands (catalog-expansion-driven).
ML Scientist lens (S3): Replace LLM-distill (Sonnet) with a self-distilled student model fine-tuned on the in-house labelled set. The student would be smaller, cheaper to run for label generation, and free of Sonnet's version-instability. Defer until the LLM-distill cost crosses ~$5K/quarter (currently ~$1.5K).
SRE lens (S3): Multi-region active-active. Currently this story serves only ap-northeast-1. EU/US expansion would require cross-region replication of reviews_clean, the model artifact, and the OpenSearch index — non-trivial residency engineering. Tracked under data-residency expansion roadmap.
FinOps lens (S3): Investigate whether the nightly batch ~5M reviews re-score is actually necessary for all 5M, or whether the subset of edit-events that meaningfully shift aspect scores is much smaller (~500K). If yes, the nightly batch shrinks 10× and cost drops to ~$170/month. Requires building an aspect-impact-of-edit predictor; defer until ROI is clear.
Content-Ops lens (S3): The uncovered-aspect detector currently fires SNS to ML-Eng on-call. As the discovery loop matures, route the SNS to a content-ops dashboard with a "promote to v4 candidate" workflow built directly into the dashboard — reduces the 14-day-detection → human-review latency. Tracked alongside the content-ops tooling roadmap.