LOCAL PREVIEW View on GitHub

US-MLE-07: Spam/Abuse Classifier + Adversarial Labels

User Story

As an ML Engineer at Amazon-scale on the MangaAssist team, I want to own a stacked-ensemble spam/abuse classifier (LightGBM first-pass + distilled-Sonnet student model on the uncertain band) with a weekly adversarial-augmented retrain, all-four-source label aggregation, real-time pre-publish gating, and per-attack-pattern drift detection, So that the chatbot's downstream review corpus stays clean — the ABSA model (US-MLE-03), the review-summary RAG path, and the platform Trust & Safety dashboard all consume reviews that have been gated against an adversary that adapts its attack patterns weekly.

Acceptance Criteria

  • Real-time endpoint classifies every new review submission before publish; p95 ≤ 200 ms including the slow-path student-model branch.
  • Real-time endpoint p99 ≤ 350 ms; p50 ≤ 35 ms (LightGBM-only short-circuit on the confident bands).
  • Nightly batch transform re-classifies the rolling 30-day published-review window; finishes by 06:00 JST.
  • Precision @ 95% recall ≥ 0.93 on the frozen golden set (asymmetric: blocking a legit user >> missing spam).
  • False-positive rate on legitimate reviews ≤ 0.5% (the cost-of-blocking-a-real-user contract).
  • Per-attack-pattern recall ≥ 0.85 across all currently-active adversarial patterns; adversarial drift detection lag ≤ 7 days from first sighting to detection alarm.
  • Weekly retrain runs Tuesday 03:00 JST off-peak; finishes by 09:00 JST. Both LightGBM and student model promoted as a coordinated pair (never one without the other).
  • Spam-cleaned review corpus is published by 09:30 JST so US-MLE-03 ABSA's weekly retrain (Wed 03:00 JST) reads a 30-day cleaned window.
  • All four label sources (vendor + implicit + programmatic + LLM-distilled) feed the label platform; aggregation rules resolve disagreements deterministically and audit-logged.
  • Adversarial training set entries carry attack_pattern_id and a higher loss weight (default 3.0×) to combat catastrophic forgetting on rare attack classes.
  • Promoted model is rollback-able to previous version via SageMaker traffic shift in ≤ 60 seconds.
  • Per-language adversarial test sets (EN, JP) gate promotion separately; either language regressing > 1% absolute on adversarial is hard-fail.
  • Per-request cost ≤ $0.000044 ($4.40 per 100K requests) at the contracted traffic mix.

Architecture (HLD)

The Production Surface

The spam classifier sits on the review-submission write path between the application gateway and the review-store write API. Every review (~80K/day at the contracted volume — same flow that US-MLE-03 ABSA reads downstream) goes through this model before the review is published. A review classified spam is held in the moderation queue; a review classified legitimate is written to the review store and queued for ABSA enrichment.

There are two serving surfaces:

  • Real-time endpoint (spam-classifier-realtime) — the pre-publish gate. p95 ≤ 200 ms is the contract; this is conservative because the slow-path student-model branch (Claude-Sonnet distilled, 350M params) only fires on the LightGBM-uncertain band [0.2, 0.8]. p50 stays at ~35 ms because ~75% of reviews short-circuit on the LightGBM confident bands.
  • Nightly batch transform (spam-classifier-batch) — re-classifies the rolling 30-day published-review window. Catches drift on previously-published reviews when a new attack pattern emerges that the v(t-7d) classifier missed at write-time but the v(t) classifier catches. Reviews newly classified spam are auto-suppressed in the moderation queue with a re-classified post-publish audit reason.

The model is a stacked ensemble:

Layer Model Role Cost
L1 (fast first-pass) LightGBM TF-IDF + URL-features + user-behavior features (account age, post velocity, IP entropy) ~3 ms / review on ml.m5.2xlarge
L2 (slow precision) Distilled-Claude-Sonnet student (350M params) Text classifier, fires only on LightGBM score ∈ [0.2, 0.8] ~110 ms / review on ml.g5.xlarge
L3 (stacker) Logistic regression Fuses (L1_score, L2_score?, behavioral_features) → final spam probability < 0.1 ms

The taxonomy is bilingual JP/EN with seven attack classes:

bot_5star               copy_paste_promo        ai_generated_review
upvote_ring             subtle_shill            indirect_injection
off_topic_political     // and "legitimate" as the 8th terminal label

JP-specific subtypes include cosmetic_vendor_zenkaku (cosmetic-vendor spam in 全角 / full-width characters bypassing English regex) and kanji_throwaway_account_farm (throwaway accounts with kanji name patterns from a discovered farm). EN-specific subtypes include crypto_pump_promo and competitor_smear. Per-language adversarial sets gate promotion separately.

Cross-Story Position

flowchart LR
    SUB[Review Submission<br/>~80K/day<br/>JP 78%, EN 22%]
    GATE[US-MLE-07 RT Endpoint<br/>Pre-publish gate]
    HOLD[Moderation Queue<br/>spam-flagged]
    STORE[Review Store<br/>Aurora + Iceberg<br/>50M cumulative]
    BATCH[US-MLE-07 Batch<br/>Nightly 30d rolling]
    ABSA[US-MLE-03 ABSA<br/>weekly retrain<br/>Wed 03:00 JST]
    REC[US-MLE-06 Recommendation<br/>aspect signal]
    TS[Trust &amp; Safety Dashboard<br/>per-class metrics]

    SUB --> GATE
    GATE -- spam --> HOLD
    GATE -- legitimate --> STORE
    STORE --> BATCH
    BATCH -- newly-flagged --> HOLD
    STORE -- spam-cleaned 30d --> ABSA
    ABSA -- aspect signal --> REC
    GATE -.metrics.-> TS
    BATCH -.metrics.-> TS

    style GATE fill:#fd2,stroke:#333
    style BATCH fill:#fd2,stroke:#333
    style ABSA fill:#9cf,stroke:#333
    style TS fill:#f66,stroke:#333

End-to-End ML Lifecycle Diagram

flowchart TB
    subgraph DATA[Data Layer - all four label sources]
        L1[Vendor Labels<br/>Appen + Sama<br/>~5K/month primary]
        L2[Implicit Labels<br/>User flag rate<br/>soft positive signal]
        L3[Programmatic Labels<br/>URL allow/denylist<br/>throwaway-email patterns<br/>profanity dict]
        L4[LLM-Distilled Labels<br/>Claude Sonnet<br/>on uncertain band]
        L5[Label Platform<br/>Iceberg on S3<br/>multi-source aggregation]
        L1 --> L5
        L2 --> L5
        L3 --> L5
        L4 --> L5
    end

    subgraph ADV[Adversarial Augmentation]
        A1[Red-team weekly<br/>~500 new patterns/wk]
        A2[Threat-intel feeds]
        A3[Production triage<br/>FP/FN corrections]
        A4[Adversarial Catalog<br/>versioned, class-stratified]
        A1 --> A4
        A2 --> A4
        A3 --> A4
    end

    subgraph FEAT[Feature Layer]
        F1[Feature Store<br/>schema_v3.4]
        F2[user-behavior FG<br/>account age, post velocity, IP entropy]
        F3[review-text FG<br/>TF-IDF, URL count, lang flag]
        F1 -.-> F2
        F1 -.-> F3
    end

    subgraph TRAIN[Training Pipeline - SageMaker Pipelines]
        T1[Step 1<br/>Data Validation<br/>+ 4-source aggregation]
        T2[Step 2<br/>Feature Materialization<br/>PIT join]
        T3[Step 3<br/>Adversarial Augmentation<br/>loss-weight 3.0x]
        T4[Step 4a<br/>LightGBM Train<br/>m5.4xlarge ~30min]
        T5[Step 4b<br/>Student Model FT<br/>g5.2xlarge ~90min]
        T6[Step 5<br/>Stacker LR fit]
        T7[Step 6<br/>Offline Eval - 5 modes<br/>+ per-attack-pattern]
        T8[Step 7<br/>Threshold Calibration<br/>P at 95R]
        T9[Step 8<br/>Model Registration<br/>coordinated pair]
        T1 --> T2 --> T3
        T3 --> T4
        T3 --> T5
        T4 --> T6
        T5 --> T6
        T6 --> T7 --> T8 --> T9
    end

    subgraph SERVE[Serving Layer]
        S1[Model Registry<br/>v(LGBM)47 + v(student)47]
        S2[Shadow Endpoint<br/>v48 parallel]
        S3[Canary 1% to 5% to 25%]
        S4[RT Endpoint<br/>m5.2xlarge LGBM<br/>g5.xlarge student MME]
        S5[Batch Transform<br/>nightly 30d rolling]
        S6[Auto-Abort Daemon]
        S1 --> S2 --> S3 --> S4
        S1 --> S5
        S6 -.monitors.-> S3
    end

    subgraph DRIFT[Drift Detection]
        D1[Drift Hub<br/>PSI/KS/Chi-sq]
        D2[Adversarial Drift<br/>per-attack-pattern recall]
        D3[CloudWatch Alarms]
        D1 --> D3
        D2 --> D3
        D3 -.triggers.-> T1
    end

    L5 --> T1
    A4 --> T3
    F1 --> T2
    T9 --> S1
    S4 -.predictions.-> D1
    S4 -.per-pattern logs.-> D2

    style L5 fill:#9cf,stroke:#333
    style A4 fill:#fde68a,stroke:#92400e
    style S1 fill:#fd2,stroke:#333
    style S6 fill:#f66,stroke:#333
    style D2 fill:#f66,stroke:#333

Data Contracts and Volume

Asset Schema Version Snapshot Cadence Volume Owner
spam_labels Iceberg table (4-source) label_v4 Continuous (label platform) ~5M cumulative; ~70K/week added (5K vendor + 50K programmatic + 13K implicit + 2K LLM-distilled) Label Platform PM
adversarial_catalog class-stratified catalog_v9 Daily ~28K cumulative; ~500/week red-team additions T&S Lead
spam_features feature group schema_v3.4 1 h batch + online realtime ~80K serving rows/day; 50M cumulative ML Eng (this story)
spam_classifier_lgbm model package model_v47 in prod Weekly LightGBM ~22MB + TF-IDF vocab ~80MB ML Eng (this story)
spam_classifier_student model package model_v47 in prod Weekly 350M-param student ~700MB ML Eng (this story)
spam_predictions_log Iceberg log_v3 Continuous ~80K rows/day, retained 90d hot, 18mo glacier Data Platform
spam_drift_metrics CloudWatch n/a 5 min (input/pred); daily (per-attack-pattern recall) ~85K data points/day Drift Hub

Model Registry + Coordinated Promotion Gates

The headline novelty here vs US-MLE-01: there are two model artifacts (LightGBM and student) that promote together as a coordinated pair. The stacker LR is a third artifact but is small and re-fit at every retrain so it never drifts independently.

flowchart LR
    P47[Pair v47<br/>LGBM_v47 + Student_v47<br/>prod, label_v=2840]:::prod
    P48[Pair v48<br/>LGBM_v48 + Student_v48<br/>candidate, label_v=2905]:::cand

    P48 --> G1{Stage 1<br/>Offline Gate<br/>P at 95R per-language<br/>per-attack-pattern recall}
    G1 -->|pass| G2{Stage 2<br/>Shadow 24h}
    G1 -.fail.-> RB1[Rollback v47]
    G2 -->|pass| G3{Stage 3<br/>Canary 7d/3d/7d<br/>FPR contract}
    G2 -.fail.-> RB2[Rollback v47]
    G3 -->|pass| G4[Stage 4<br/>Full Promote v48]
    G3 -.fail.-> RB3[Rollback v47]

    classDef prod fill:#2d8,stroke:#333
    classDef cand fill:#fd2,stroke:#333
    style RB1 fill:#f66,stroke:#333
    style RB2 fill:#f66,stroke:#333
    style RB3 fill:#f66,stroke:#333
    style G4 fill:#2d8,stroke:#333

The four-stage promotion gate is the standard from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1. Story-specific thresholds:

  • Stage 1 (offline): P @ 95R ≥ 0.93 on golden set (per-language); FPR ≤ 0.5% on legitimate-review holdout; per-attack-pattern recall ≥ 0.85 for every currently-active pattern.
  • Stage 2 (shadow): per-request agreement with v47 between 92% and 99% (a fast-pass classifier should agree with prior version on the easy band; disagreements concentrate on the uncertain band where the slow path fires). Shadow latency p95 ≤ 1.2× v47.
  • Stage 3 (canary): live FPR ≤ 0.5% (measured via appeal rate); held-back-corpus per-attack-pattern recall stable; downstream ABSA Friday-eval doesn't regress.
  • Stage 4 (full): traffic shift to 100%; old pair v47 retained for 21 days as rollback target (longer than US-MLE-01's 14 days because adversarial drift can mask itself for up to 7 days).

Low-Level Design

1. Feature / Data Pipeline

The spam classifier reads three feature groups from the SageMaker Feature Store, all pinned to schema_v3.4:

  • Review-text features (computed online): TF-IDF over 50K vocab (separate JP / EN vocabs, joined via language detector), URL count, has-money-symbol flag, character set distribution, full-width-character ratio (catches cosmetic_vendor_zenkaku), perplexity from a small N-gram LM (soft AI-generated signal), language detector output.
  • User-behavior features (computed via 1-hour batch + 5-min Redis aggregation): account age (days since registration), post velocity (reviews / hour over last 24h, last 7d), IP entropy (Shannon entropy over recent IPs the account has posted from — low entropy = burst from one IP, suggests automation), email-domain throwaway-flag (regex against a maintained list), prior moderation-strike count.
  • Review-context features (computed online): title-id, time-since-title-release (organic burst window vs late-night-shill window), reviewer-vs-purchaser flag (from order-store join), session_id.
# feature_pipeline.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
import boto3
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup


@dataclass
class SpamTrainingExample:
    example_id: str               # sha256(canonical_review_text + reviewer_id + captured_at)
    reviewer_id: str              # hashed
    title_id: str
    captured_at: datetime
    label_value: str              # "spam" | "legitimate"
    attack_pattern_id: Optional[str]  # e.g., "ai_generated_v6", null if legitimate
    label_version: int
    label_source: str             # "vendor" | "implicit" | "programmatic" | "llm_distill"
    label_loss_weight: float      # 3.0 for adversarial, 1.0 for routine
    review_text: str              # canonical, NFKC-normalized
    language_detected: str        # "en" / "ja" / "mixed"


class SpamFeaturePipeline:
    """PIT-correct feature read for spam-classifier training.

    Implements primitive §2.1 from the foundations doc. user-behavior features
    have non-zero serving lag (1h) — critical because account_age and
    post_velocity at training time must reflect what the serving system saw at
    review-submit time, not what they look like a week later when the label is
    finalized by a vendor.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.session = sagemaker.Session(boto3.Session(region_name=region))
        self.fs = sagemaker.feature_store.FeatureStoreRuntime(self.session)
        self.catalog = FeatureCatalog.load("schema_v3.4")
        self.serving_lag = self.catalog.serving_lag_per_group()

    def materialize(
        self,
        examples: list[SpamTrainingExample],
        feature_groups: list[str],
    ) -> "pd.DataFrame":
        """Read PIT-correct features for every example."""
        rows = []
        for ex in examples:
            row = {
                "example_id": ex.example_id,
                "label": ex.label_value,
                "loss_weight": ex.label_loss_weight,
                "attack_pattern_id": ex.attack_pattern_id,
                "language": ex.language_detected,
            }
            for fg_name in feature_groups:
                lag = self.serving_lag[fg_name]
                as_of = ex.captured_at - lag
                features = self.fs.get_record(
                    feature_group_name=fg_name,
                    record_identifier_value_as_string=ex.reviewer_id,
                    as_of_timestamp=as_of,
                )
                row.update(features)
            rows.append(row)
        return pd.DataFrame(rows)

The leak detector (the standard primitive §2.1 implementation) caught one production-impacting issue on this pipeline: prior_moderation_strike_count was being computed against the moderation table at training-snapshot time rather than captured_at time, which leaks future strikes into past examples. The fix was to read the feature with as_of=captured_at - 1h (the documented serving lag) and add a deploy-time lint preventing direct table reads.

2. The Four-Source Label Aggregation (Headline Pattern)

This is the only story on the platform that uses all four label sources from primitive §1.1. Disagreement between sources is structural, not noise — vendor labels are slow but high-quality, implicit labels are fast but soft, programmatic is fast but rule-set-biased, LLM-distilled is medium but inherits LLM biases. The aggregation rules are deterministic and audit-logged:

# label_aggregator.py
from enum import Enum

class LabelSource(Enum):
    VENDOR = "vendor"
    IMPLICIT = "implicit"
    PROGRAMMATIC = "programmatic"
    LLM_DISTILL = "llm_distill"


# Source confidence weights — calibrated quarterly against vendor-relabeled gold.
SOURCE_WEIGHTS = {
    LabelSource.VENDOR:       0.85,   # high quality but slow
    LabelSource.PROGRAMMATIC: 0.70,   # rule-set-biased but precise on rule-set's coverage
    LabelSource.LLM_DISTILL:  0.55,   # medium; for uncertain-band coverage only
    LabelSource.IMPLICIT:     0.30,   # noisy; positive signal from user-flag rate only
}


def aggregate_labels(records: list[LabelRecord]) -> AggregatedLabel:
    """Resolve disagreement across four sources for a single example_id.

    Rules (in order):
    1. If vendor label exists and review_status="qa_passed", VENDOR WINS. Period.
       The other sources are recorded for audit but the vendor label is canonical.
    2. Otherwise if 2+ sources agree, take the majority. Use weighted vote on tie.
    3. Otherwise if only one source has a label, use it but flag as
       low_confidence — the example will receive lower training weight (0.5x)
       and is excluded from holdout evaluation.
    4. Disagreements are logged to spam_label_disagreements Iceberg table for
       quarterly calibration audit.
    """
    by_source = {r.label_source: r for r in records}

    # Rule 1: vendor wins on qa_passed
    vendor = by_source.get(LabelSource.VENDOR.value)
    if vendor and vendor.review_status == "qa_passed":
        # Audit log if any other source disagreed
        disagreements = [
            (s, r.label_value)
            for s, r in by_source.items()
            if s != LabelSource.VENDOR.value and r.label_value != vendor.label_value
        ]
        if disagreements:
            log_disagreement(vendor, disagreements)
        return AggregatedLabel(
            value=vendor.label_value,
            source_used=LabelSource.VENDOR.value,
            confidence=1.0,
            audit_disagreements=disagreements,
        )

    # Rule 2: weighted majority
    weighted = {}
    for source, record in by_source.items():
        w = SOURCE_WEIGHTS[LabelSource(source)]
        weighted[record.label_value] = weighted.get(record.label_value, 0) + w
    if not weighted:
        return None
    winner = max(weighted, key=weighted.get)
    total = sum(weighted.values())
    confidence = weighted[winner] / total
    if confidence >= 0.65 and len(by_source) >= 2:
        return AggregatedLabel(
            value=winner,
            source_used="weighted_majority",
            confidence=confidence,
            audit_disagreements=[],
        )

    # Rule 3: low-confidence single source
    return AggregatedLabel(
        value=winner,
        source_used="single_source_low_conf",
        confidence=confidence,
        audit_disagreements=[],
        low_confidence=True,
    )

The disagreement audit table has caught two real production issues: a programmatic rule that flagged any review with a URL as spam (catching legitimate reviewer-blog citations 3% of the time, contradicting vendor labels), and an LLM-distillation prompt that under-flagged subtle_shill because Sonnet was instructed to be "conservative on borderline cases." Both were fixed via prompt-update / rule-update; the disagreement-rate dashboard catches future regressions of the same shape.

3. Training Pipeline (SageMaker Pipelines)

# training_pipeline.py
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.parameters import ParameterString, ParameterInteger
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.huggingface.estimator import HuggingFace
from sagemaker.estimator import Estimator


def build_pipeline(role: str, region: str = "ap-northeast-1") -> Pipeline:
    label_version = ParameterInteger(name="MaxLabelVersion", default_value=0)
    feature_schema = ParameterString(name="FeatureSchemaVersion", default_value="schema_v3.4")
    adv_catalog_version = ParameterString(name="AdversarialCatalogVersion", default_value="v9")

    # Step 1: Data validation + 4-source aggregation
    data_val = ProcessingStep(
        name="SpamDataValidation",
        processor=SKLearnProcessor(
            framework_version="1.3-1",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
        ),
        code="src/data_validation.py",
        job_arguments=[
            "--label-version", label_version.to_string(),
            "--min-iaa-en", "0.75",
            "--min-iaa-jp", "0.75",
            "--required-attack-class-coverage", "7",
            "--min-examples-per-attack-class-jp", "150",
            "--min-examples-per-attack-class-en", "150",
            "--source-weights-config", "s3://manga-ml-config-apne1/spam/source_weights_v4.yaml",
            "--disagreement-audit-table", "spam_label_disagreements",
        ],
    )

    # Step 2: PIT-correct feature materialization
    feature_mat = ProcessingStep(
        name="SpamFeatureMaterialization",
        processor=SKLearnProcessor(
            framework_version="1.3-1",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
        ),
        code="src/feature_materialize.py",
        depends_on=[data_val],
        job_arguments=[
            "--feature-schema", feature_schema.to_string(),
            "--leak-detector-pct", "0.01",
            "--leak-score-abort-threshold", "0.85",
        ],
    )

    # Step 3: Adversarial augmentation with loss-weight upweighting
    adv_aug = ProcessingStep(
        name="SpamAdversarialAugmentation",
        processor=SKLearnProcessor(
            framework_version="1.3-1",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
        ),
        code="src/adversarial_augment.py",
        depends_on=[feature_mat],
        job_arguments=[
            "--catalog-version", adv_catalog_version.to_string(),
            "--adversarial-loss-weight", "3.0",     # combats catastrophic forgetting
            "--regression-set-loss-weight", "2.0",  # old attacks must keep being caught
            "--seed", "42",
        ],
    )

    # Step 4a: LightGBM training
    lgbm_estimator = Estimator(
        image_uri=lightgbm_training_image_uri(region),
        role=role,
        instance_type="ml.m5.4xlarge",
        instance_count=1,
        use_spot_instances=True,
        max_wait=7200,
        max_run=3600,
        hyperparameters={
            "objective": "binary",
            "metric": "binary_logloss",
            "num_leaves": 127,
            "learning_rate": 0.05,
            "feature_fraction": 0.85,
            "bagging_fraction": 0.85,
            "bagging_freq": 5,
            "num_boost_round": 800,
            "early_stopping_rounds": 50,
            "is_unbalance": "true",      # 1.2% spam baseline
            "lambda_l2": 0.1,
            "seed": 42,
        },
    )
    lgbm_train = TrainingStep(name="SpamLGBMTrain", estimator=lgbm_estimator, depends_on=[adv_aug])

    # Step 4b: Distilled student model fine-tuning (parallel to LGBM)
    student_estimator = HuggingFace(
        entry_point="train_student.py",
        source_dir="src/",
        instance_type="ml.g5.2xlarge",
        instance_count=1,
        role=role,
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        use_spot_instances=True,
        max_wait=10800,
        max_run=7200,
        checkpoint_s3_uri="s3://manga-ml-checkpoints-apne1/spam/student/",
        checkpoint_local_path="/opt/ml/checkpoints",
        hyperparameters={
            "model_name_or_path": "spam-student-350m-base",
            "num_train_epochs": 3,
            "per_device_train_batch_size": 32,
            "learning_rate": 2e-5,
            "warmup_ratio": 0.1,
            "weight_decay": 0.01,
            "fp16": True,
            "load_best_model_at_end": True,
            "metric_for_best_model": "p_at_95r",
            "use_loss_weights": True,    # honors adversarial loss-weight column
            "seed": 42,
        },
    )
    student_train = TrainingStep(
        name="SpamStudentTrain",
        estimator=student_estimator,
        depends_on=[adv_aug],
    )

    # Step 5: Stacker (logistic regression on (lgbm_score, student_score, behavioral))
    stacker = ProcessingStep(
        name="SpamStackerFit",
        processor=SKLearnProcessor(
            framework_version="1.3-1",
            role=role,
            instance_type="ml.m5.2xlarge",
            instance_count=1,
        ),
        code="src/fit_stacker.py",
        depends_on=[lgbm_train, student_train],
    )

    # Step 6: Offline evaluation + per-attack-pattern + threshold calibration at P@95R
    offline_eval = ProcessingStep(
        name="SpamOfflineEval",
        processor=SKLearnProcessor(
            framework_version="1.3-1",
            role=role,
            instance_type="ml.g5.xlarge",
            instance_count=1,
        ),
        code="src/offline_eval.py",
        depends_on=[stacker],
        job_arguments=[
            "--golden-set-uri", "s3://manga-ml-eval-apne1/spam/golden/v4.parquet",
            "--adversarial-set-uri", "s3://manga-ml-eval-apne1/spam/adversarial/v9.parquet",
            "--regression-set-uri", "s3://manga-ml-eval-apne1/spam/regression/all-time.parquet",
            "--counterfactual-set-uri", "s3://manga-ml-eval-apne1/spam/replay/last7d.parquet",
            "--baseline-pair-version", "47",
            "--target-recall", "0.95",
            "--max-fpr", "0.005",
        ],
    )

    # Stage 1 offline gate
    promote_gate = ConditionStep(
        name="SpamOfflineGate",
        conditions=[
            ConditionGreaterThanOrEqualTo(
                left=offline_eval.properties.ProcessingOutputConfig.Outputs[
                    "p_at_95r"
                ].S3Output.S3Uri,
                right="0.93",
            ),
        ],
        if_steps=[register_pair_step(lgbm_estimator, student_estimator)],
        else_steps=[abort_step()],
        depends_on=[offline_eval],
    )

    return Pipeline(
        name="SpamClassifierWeeklyRetrain",
        parameters=[label_version, feature_schema, adv_catalog_version],
        steps=[data_val, feature_mat, adv_aug, lgbm_train, student_train, stacker, offline_eval, promote_gate],
    )

4. The Stacked Ensemble (Architecture Detail)

The serving-time decision flow:

# stacked_inference.py
import lightgbm as lgb
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification


class StackedSpamClassifier:
    """L1 LightGBM + L2 distilled student + L3 stacker LR.

    Latency budget on the hot path:
      - LightGBM:           ~3 ms always
      - Student (cond):    ~110 ms only on uncertain band [0.2, 0.8]
      - Stacker LR:        <0.1 ms always
      - p95 contract:       <=200 ms (slow path included)
      - p50 expected:        ~35 ms (75% short-circuit on confident bands)
    """

    UNCERTAIN_LO = 0.20
    UNCERTAIN_HI = 0.80

    def __init__(self, lgbm_path: str, student_path: str, stacker_path: str, vocab_path: str):
        self.lgbm = lgb.Booster(model_file=lgbm_path)
        self.tokenizer = AutoTokenizer.from_pretrained(student_path)
        self.student = AutoModelForSequenceClassification.from_pretrained(student_path).eval()
        self.student.to("cuda").half()
        self.stacker = joblib.load(stacker_path)
        self.tfidf_vocab = TfidfVocab.load(vocab_path)

    def predict(self, review: ReviewSubmission) -> SpamPrediction:
        # L1 features: TF-IDF + URL + behavioral
        text_feats = self.tfidf_vocab.transform(review.text, lang=review.language)
        behavioral = np.array([
            review.account_age_days,
            review.post_velocity_24h,
            review.ip_entropy,
            review.email_throwaway_flag,
            review.prior_moderation_strikes,
            review.full_width_char_ratio,
        ])
        url_feats = url_feature_vec(review.text)
        l1_input = np.concatenate([text_feats, behavioral, url_feats])

        # L1: LightGBM fast pass
        l1_score = float(self.lgbm.predict(l1_input.reshape(1, -1))[0])

        # Short-circuit on confident bands
        if l1_score < self.UNCERTAIN_LO:
            final = float(self.stacker.predict_proba(
                np.array([[l1_score, 0.0, *behavioral]])
            )[0, 1])
            return SpamPrediction(score=final, l1=l1_score, l2=None, path="fast_legit")
        if l1_score > self.UNCERTAIN_HI:
            final = float(self.stacker.predict_proba(
                np.array([[l1_score, 1.0, *behavioral]])
            )[0, 1])
            return SpamPrediction(score=final, l1=l1_score, l2=None, path="fast_spam")

        # L2: distilled student on uncertain band
        with torch.no_grad():
            inputs = self.tokenizer(
                review.text, return_tensors="pt", truncation=True, max_length=512
            ).to("cuda")
            logits = self.student(**inputs).logits
            l2_score = float(torch.softmax(logits, dim=-1)[0, 1].cpu())

        # L3: stacker LR fuses L1 + L2 + behavioral
        stacker_input = np.array([[l1_score, l2_score, *behavioral]])
        final = float(self.stacker.predict_proba(stacker_input)[0, 1])
        return SpamPrediction(score=final, l1=l1_score, l2=l2_score, path="slow_uncertain")

The uncertain band [0.2, 0.8] is itself a tuned hyperparameter. Empirically ~25% of reviews fall in this band and get the slow-path treatment. Tightening to [0.3, 0.7] would drop slow-path coverage to ~15% (savings ~30% on student-model GPU cost) but raises FPR by 0.15 percentage points — outside the 0.5% contract. Tuning the band is part of the weekly retrain's threshold-calibration step.

5. Threshold Calibration: P@95R, Not F1 (Asymmetric Cost)

The cost of a false positive (blocking a legitimate user's review) is dramatically higher than the cost of a false negative (letting one spam review through). A blocked legitimate review:

  • Frustrates a real customer who took the time to write feedback.
  • Triggers an appeal flow with human-moderator cost.
  • If repeated, leads to reviewer churn — the customer stops reviewing.
  • In aggregate, kills the legitimate-review flow that the entire ABSA + recommendation system depends on.

Whereas a missed spam review:

  • Pollutes ABSA training for one example out of ~80K/day.
  • Gets caught by the nightly batch transform within 24 hours (defense in depth).
  • Potentially gets caught by user reports → human moderator → manual suppression.

So instead of optimizing F1 (symmetric), the threshold is calibrated for Precision @ 95% Recall — pin recall at 0.95 (meaning 95% of true spam is caught at threshold τ), then maximize precision (P @ 95R ≥ 0.93 means no more than 7% of predicted spam are false positives). At the contract spam baseline of 1.2%, this works out to FPR ≤ 0.5% on legitimate reviews. The threshold τ is fit on a held-out calibration set and re-fit at every retrain.

# threshold_calibration.py
def calibrate_threshold_at_recall(
    scores: np.ndarray,
    labels: np.ndarray,
    target_recall: float = 0.95,
) -> tuple[float, float, float]:
    """Pin recall at target, return (threshold, precision_at_target, fpr)."""
    sorted_idx = np.argsort(-scores)
    sorted_scores = scores[sorted_idx]
    sorted_labels = labels[sorted_idx]

    total_positives = sorted_labels.sum()
    cum_tp = np.cumsum(sorted_labels)
    cum_fp = np.cumsum(1 - sorted_labels)
    recalls = cum_tp / total_positives
    precisions = cum_tp / (cum_tp + cum_fp + 1e-9)

    # Find smallest threshold (largest k) where recall >= target
    eligible = recalls >= target_recall
    if not eligible.any():
        raise ThresholdNotAchievable(f"max recall = {recalls.max():.3f}")
    k = np.where(eligible)[0][0]
    threshold = float(sorted_scores[k])
    precision_at = float(precisions[k])
    fpr = float(cum_fp[k] / (labels == 0).sum())
    return threshold, precision_at, fpr

6. Why Online Learning Was Considered and Rejected

A continuous-online-update pipeline (gradient updates from each labeled example as it arrives) was evaluated and rejected. The reasoning:

  • Volume doesn't justify it. 80K reviews/day with ~70K labels/week = ~1.4% data-set turnover per week. Online learning's headline benefit is freshness on high-velocity signal; weekly retrain is sufficient to capture this turnover.
  • Adversarial drift is bursty, not continuous. Attackers don't ramp; they drop a new attack pattern overnight. Online learning can't out-react a weekly red-team adversarial-augmentation cycle, because the training-set augmentation is the latency bottleneck, not the gradient-update step.
  • Catastrophic forgetting risk is higher with online. A noisy implicit-label burst (e.g., a brigade of false flags from a competitor's user base) would shift the online model in seconds. With weekly retrain + 4-source aggregation, the brigade is visible in the disagreement-audit table and can be mitigated via source-weight tuning before it touches the model.
  • Rollback semantics break. SageMaker traffic-shift rollback assumes a discrete "previous version." Online learning has no clean rollback target; the previous version is whatever the model was 1 hour ago, and that may be already poisoned.
  • Operational cost is higher. A continuous-update pipeline needs its own feature-store online-write path, gradient-server, and audit log. The marginal benefit at this volume doesn't pay for the marginal complexity.

The decision is reviewed every 6 months. Re-evaluation triggers: review submission volume crosses 500K/day (6× current), or adversarial drift detection lag falls below 7 days for two quarters in a row (suggesting weekly is bottleneck).

7. Online Serving (Real-Time + Batch)

The real-time endpoint uses a two-instance hybrid: LightGBM on ml.m5.2xlarge (no GPU needed, fast TF-IDF + tree inference), student model on ml.g5.xlarge Multi-Model Endpoint (MME). The application gateway calls LightGBM first, decides to skip-or-call student based on the uncertain-band check, then calls the stacker LR (in-process, microseconds).

# endpoint_config.py
def deploy_spam_endpoints(
    lgbm_pkg_arn: str, student_pkg_arn: str, stacker_pkg_arn: str, version: str
) -> tuple[str, str]:
    # LightGBM endpoint - ml.m5.2xlarge, autoscale 2-8
    lgbm_endpoint = sagemaker.Model(
        model_data=lgbm_pkg_arn,
        role=role,
        env={
            "SM_MODEL_VERSION": version,
            "MAX_BATCH_SIZE": "16",
            "MAX_BATCH_DELAY_MS": "5",
        },
    ).deploy(
        initial_instance_count=2,
        instance_type="ml.m5.2xlarge",
        endpoint_name=f"spam-lgbm-{version}",
        data_capture_config=DataCaptureConfig(
            enable_capture=True,
            sampling_percentage=10,
            destination_s3_uri="s3://manga-ml-capture-apne1/spam/",
            capture_options=["REQUEST", "RESPONSE"],
        ),
    )

    # Student model on MME for the uncertain-band branch
    student_endpoint = sagemaker.MultiDataModel(
        name=f"spam-student-mme-{version}",
        model_data_prefix="s3://manga-ml-models-apne1/spam/student/",
        role=role,
    ).deploy(
        initial_instance_count=1,
        instance_type="ml.g5.xlarge",
        endpoint_name=f"spam-student-{version}",
    )
    return lgbm_endpoint.endpoint_name, student_endpoint.endpoint_name

The nightly batch transform runs at 02:00 JST and processes the rolling 30-day published-review window (~2.4M reviews):

# batch_transform.py
def run_nightly_batch(version: str, date: datetime):
    transformer = Transformer(
        model_name=f"spam-stacked-batch-{version}",
        instance_count=4,
        instance_type="ml.g5.2xlarge",
        max_concurrent_transforms=8,
        max_payload=6,    # MB
        strategy="MultiRecord",
        output_path=f"s3://manga-ml-spam-batch-apne1/{date.strftime('%Y%m%d')}/",
    )
    input_uri = f"s3://manga-reviews-apne1/published/last30d/{date.strftime('%Y%m%d')}/"
    transformer.transform(input_uri, content_type="application/jsonlines", split_type="Line")
    transformer.wait()

    # Newly-flagged reviews (current model says spam, prior was legit)
    diff = compute_classification_diff(
        new_uri=transformer.output_path,
        prior_uri=f"s3://manga-ml-spam-batch-apne1/{(date - timedelta(days=1)).strftime('%Y%m%d')}/",
    )
    push_to_moderation_queue(diff.newly_flagged, reason="re-classified post-publish")
    publish_spam_cleaned_corpus(
        date=date,
        corpus_uri="s3://manga-reviews-clean-apne1/last30d/",
        # this is the input to US-MLE-03 ABSA's weekly retrain
    )

8. Drift Detection — Adversarial Drift as the Dominant Kind

The four standard drift kinds from primitive §6.1 (input, label, prediction, concept) all run, but the dominant one for this story is adversarial drift — a fifth kind specific to this story. Adversarial drift is detected as a sustained drop in per-attack-pattern recall on a continuously-refreshed evaluation set.

Drift Kind Detector Cadence Threshold
Input drift PSI on each top-30 feature; KL on token distribution 5 min PSI > 0.2 sustained 24h
Label drift χ² on attack-pattern proportions vs reference Daily p < 0.01 sustained 7d
Prediction drift KS on confidence distribution; per-pattern call rate 5 min KS > 0.15 sustained 24h
Concept drift Rolling-holdout P@95R on most-recent-2-week labeled subset Daily Δ-P@95R > 0.03 sustained 7d
Adversarial drift Per-attack-pattern recall on red-team + threat-intel daily samples Daily Per-pattern recall < 0.85 for any currently-active pattern, 24h

The adversarial-drift signal is what makes this story's ≤ 7 days detection-lag contract achievable. The red team injects ~500 new attack patterns/week into a daily-refreshed evaluation set; if any single pattern's recall drops below 0.85, an alarm fires within 24 hours of first sighting. Combined with the heuristic-shipping path (24-hour same-day blocklist deploy) and the weekly retrain cycle, the end-to-end "first sighting → classifier learns the pattern" loop closes in 7 days at the worst case.

# adversarial_drift.py
def compute_adversarial_drift(
    daily_eval_set: pd.DataFrame,
    model: StackedSpamClassifier,
    threshold: float,
) -> dict[str, float]:
    """Per-attack-pattern recall on the daily-refreshed eval set."""
    recalls = {}
    for pattern_id, group in daily_eval_set.groupby("attack_pattern_id"):
        if pattern_id is None or pattern_id == "legitimate":
            continue
        scores = np.array([model.predict(r).score for r in group.itertuples()])
        preds = scores >= threshold
        truths = (group["expected_label"] == "spam").values
        if truths.sum() == 0:
            continue
        recall = (preds & truths).sum() / truths.sum()
        recalls[pattern_id] = float(recall)
    return recalls


def evaluate_adversarial_drift_alarms(recalls: dict[str, float]) -> list[Alarm]:
    alarms = []
    for pattern_id, recall in recalls.items():
        if recall < 0.85:
            pattern_meta = AdversarialCatalog.get(pattern_id)
            alarms.append(Alarm(
                pattern_id=pattern_id,
                pattern_class=pattern_meta.attack_class,
                recall=recall,
                first_sighting_at=pattern_meta.first_sighting,
                age_days=(datetime.utcnow() - pattern_meta.first_sighting).days,
                severity="SEV-3" if pattern_meta.is_active else "SEV-4",
            ))
    return alarms

9. Retraining Trigger Logic

Three retrain triggers (one more than US-MLE-01):

  • Scheduled: EventBridge rule fires every Tuesday 03:00 JST. Triggers the pipeline with MaxLabelVersion = max(label_version) and AdversarialCatalogVersion = current.
  • Drift-triggered: drift hub publishes when concept_drift > 0.03 for 7d OR per-attack-pattern recall < 0.85 for any active pattern for 24h.
  • Adversarial-burst-triggered: red team + threat-intel pipeline publishes when >= 50 new attack patterns/24h are seen (a coordinated attack wave). This bypasses the weekly cadence.

All three go through the standard promotion-eligibility gate (no concurrent retrains, no global_ml_freeze).

10. Multilingual Handling (JP/EN-Specific)

Per-language vocabularies and TF-IDF. EN and JP have separate TF-IDF vocabs (50K each), joined into a sparse concatenated feature space. The language detector output is also a feature. The student model is multilingual (mBERT-distilled), so it natively handles JP/EN/mixed.

JP-specific spam patterns. Two JP attack classes that EN-only models miss:

  1. cosmetic_vendor_zenkaku — cosmetic-vendor spam written in 全角 (full-width) characters. The exact same English words like "BUY NOW" rendered as "BUY NOW" bypass naive English regex blocklists. Mitigation: NFKC normalization in the feature pipeline collapses full-width to half-width before TF-IDF, and a full_width_char_ratio feature explicitly captures the original ratio (high ratio is itself a soft signal).
  2. kanji_throwaway_account_farm — throwaway accounts with kanji name patterns from a discovered farm (random-kanji concatenation that no real Japanese name would have). Mitigation: a per-character-bigram model trained on real JP names assigns a name-naturalness score; below a threshold, the account is flagged as "suspect throwaway."

Per-language adversarial test sets. Promotion gates evaluate adversarial regression separately for EN and JP. A model that improves EN adversarial recall by +3% while regressing JP adversarial by -2% is rejected. This is the same pattern as US-MLE-01's per-language slice gate.

Per-language source weights. The SOURCE_WEIGHTS constants in the label aggregator are pinned globally, but the quarterly calibration job re-fits them per language — the JP vendor team has historically had higher κ than the EN team on subtle_shill (the JP team has more cultural context for shill patterns), so JP vendor labels are weighted slightly higher (0.88) than EN vendor labels (0.85).

11. PII Redaction

Review text is UGC and frequently contains PII (email addresses, phone numbers, order numbers, real names). Same redaction primitive as US-MLE-03:

  • Pre-feature-extraction redaction pass replaces detected PII with sentinel tokens (<EMAIL>, <PHONE>, <ORDER_ID>, <PERSON>).
  • Redaction happens in the feature pipeline before the example is written to the feature store, so the feature store never holds raw PII.
  • Vendor labelers see redacted text only; if a vendor needs to escalate an example for true-PII context, the escalation goes through a separate access-controlled path (audit-logged).
  • The adversarial catalog's red-team examples carry a synthetic_pii=true flag (red-team-fabricated PII for realism) and are excluded from any cross-corpus reuse — same pattern as US-MLE-01's adversarial set.

Monitoring & Metrics

Category Metric Target Alarm Threshold
Online — Latency p50 inference (fast path) ≤ 35 ms > 60 ms 5min
p95 inference (slow path included) ≤ 200 ms > 280 ms 5min
p99 inference ≤ 350 ms > 500 ms 5min
Online — Slow path % % requests routed to student model ≤ 30% > 40% 1h (cost) / < 10% (precision risk)
Online — Throughput Reviews/day match traffic (~80K) scale-out lag > 60s
Quality — Headline Precision @ 95% Recall (golden) ≥ 0.93 < 0.90 24h
False-positive rate (legit holdout) ≤ 0.5% > 0.7% 24h
Appeal rate per 1k suppressions ≤ 30 > 60 24h
Quality — Per-attack-pattern Recall per active pattern ≥ 0.85 < 0.80 24h
Quality — Per-language P@95R EN / JP ≥ 0.93 each < 0.90 24h
Drift — Adversarial Detection lag from first sighting ≤ 7 days > 10 days for any pattern
Drift — Standard PSI per top-30 feature < 0.2 > 0.2 24h
Δ-P@95R vs reference < 0.03 > 0.03 7d
Cost $/1k inferences ≤ $0.044 > $0.060 24h
Training $/run (LGBM + student combined) ≤ $35 > $55
Pipeline Weekly retrain success rate ≥ 95% < 90% 30d rolling
Pipeline wall-clock ≤ 6h > 8h
Cross-story Spam-cleaned corpus delivery to ABSA by 09:30 JST Tue missed by 1h → SEV-4
Source-aggregation Per-source disagreement rate (vendor vs other 3) < 12% > 20% 7d

Risks & Mitigations

Risk Impact Mitigation
Adversary identifies the LightGBM uncertain band [0.2, 0.8] and crafts attacks scoring just outside it Slow-path student model never fires; precision drops Band is itself a moving threshold re-calibrated weekly; randomized band-jitter (±0.02) on a small sample to detect band-evasion; alarm fires if l1_score density at exactly 0.19 / 0.81 spikes
FPR explodes silently → legitimate-reviewer churn Reviewer trust collapses; ABSA training corpus shrinks Appeal-rate dashboard + per-cohort suppression-rate; deploy-time hard gate on FPR ≤ 0.5%; canary auto-abort on appeal-rate spike
New attack class emerges that's not in the taxonomy Per-attack-pattern recall metrics don't catch it (the class doesn't exist yet) Triage process: 24h heuristic ship + 7d catalog seed + next-week classifier coverage; threat-intel feed adds external classes the platform hasn't seen yet
Vendor IAA on JP subtle_shill drops below 0.75 Subtle-shill class regresses on next retrain Per-annotator running κ dashboards; reject-and-resend at vendor expense; per-batch κ gate refuses training if < 0.75
LightGBM and student model promoted out of sync Stacker LR fed mismatched score distributions; production prediction quality collapses Coordinated promotion enforced at registry level; pair-version tags; deploy-time refusal if pair tags don't match
Catastrophic forgetting on rare adversarial classes Old attack patterns succeed again after a retrain Adversarial loss-weight = 3.0×; regression-set retained forever; per-pattern recall on regression set is a hard-gate
Online learning re-evaluation is forgotten At higher volume, weekly retrain becomes insufficient 6-month re-evaluation cadence; explicit triggers (500K/day or 2 quarters of detection-lag < 7d) documented in registry metadata
Programmatic rule update silently changes label distribution Training set drifts; model learns the rule rather than the underlying pattern Disagreement-audit table; quarterly calibration of source weights; a programmatic-rule change is a Coordinated Change Request with sign-off
Indirect-injection attacks bypass spam classifier into LLM context Cost / safety risk in downstream RAG indirect_injection is a first-class attack class; coordination with GenAI guardrails team for paired threat-intel; defense-in-depth at retrieval too
Honeypot accounts get burned (spammers learn them) Honeypot stops catching attackers Honeypot accounts/titles rotate quarterly; rotation schedule is itself a secret artifact in registry metadata
Whitelist requests from marketing for "trusted publishers" Compromised publisher account → undetected spam at scale Blanket whitelist refused; instead publisher_verified is a soft positive feature; verified accounts still scored, just with a confidence boost

Deep Dive — Why This Works at Amazon-Scale on the Manga Workload

Workload property 1: adversarial cadence beats most retrain cadences. Spam attackers adapt continuously; a defender retraining quarterly trails by 90 days. The chosen weekly retrain plus daily heuristic ship plus 24h adversarial-drift detection together create a defender cadence that beats the attacker for ~90% of attack patterns. The remaining ~10% (zero-day attacks before the catalog catches them) are mitigated by the re-classified post-publish batch loop — even if a payload slips through at write-time, it's caught within 24 hours. This three-cadence defense (heuristic / classifier / batch) is the structural answer to "the test set is a photograph of a world that no longer exists."

Workload property 2: asymmetric error costs require P@95R, not F1. The per-event cost of a false positive (blocking a real reviewer, with churn risk) is at least 10× the per-event cost of a false negative (one polluted review out of 80K/day, caught the next night by batch). F1 weighs both equally — wrong for this domain. P@95R explicitly says "first hit the recall floor, then minimize false positives," which matches the cost structure. This is the same lesson the Trust & Safety industry has learned across 20 years of spam classifier evolution; F1-tuned classifiers consistently over-block.

Workload property 3: bilingual JP/EN with JP-specific attack surface. The cosmetic_vendor_zenkaku and kanji_throwaway_account_farm attack classes are JP-specific and would be invisible to an EN-only model. A model that achieves global P@95R = 0.93 by sacrificing JP-only attack recall to 0.70 is not acceptable; the per-language adversarial gate prevents this. The bilingual store's 78% JP / 22% EN traffic mix means JP regression has 3× the impact of EN regression in raw-volume terms, doubly justifying the per-language gate.

Workload property 4: the four label sources are needed because each one alone is insufficient. Vendor labels are accurate but slow (5K/month, 24-72h latency); programmatic is fast but rule-set-biased; LLM-distilled is medium-quality and inherits LLM biases; implicit (user flags) is noisy and easily brigaded. Aggregating across all four with deterministic disagreement resolution captures the strengths of each while bounding the weakness of any one. The disagreement-audit table is the load-bearing piece — without it, source biases compound silently. This is the only story on the platform using all four sources, and it does so because no single source can keep up with the adversarial cadence on its own.

Workload property 5: stacked ensemble reflects the true cost-quality frontier. A pure-LightGBM model is fast and cheap but misses the long-tail of subtle-shill and AI-generated reviews; a pure-student model catches them but costs ~30× more per inference. The stacked ensemble routes ~75% of reviews through the fast path (LightGBM-confident) and only the ~25% uncertain band through the slow path (student). On the contracted volume this saves ~$1,800/month vs all-slow-path while preserving the precision contract. This is the same "tiered inference" pattern from the Cost-Optimization stories, applied to a binary classifier.

These five properties together explain the per-attack-pattern + per-language + 4-source + stacked + multi-cadence machinery. A naïve "DistilBERT, weekly retrain, F1 threshold" would silently regress on JP attack classes, over-block legitimate reviewers, and trail the adversary by ~30 days.


Real-World Validation

Industry analogues. Meta's review-spam classifier (per their 2024 Trust & Safety blog) uses a similar two-stage architecture (fast feature-based first-pass + slow text-classifier second-pass), with reported P@95R in the 0.91–0.94 range — comparable to this story's 0.93 target. Yelp's review-fraud team has published on their behavioral graph detector (analogous to this story's upvote_ring class), with cadence-separated heuristic + classifier defense matching this story's structure. Google's Safe Browsing team uses a four-source label aggregation (vendor + telemetry + heuristic + ML-distilled) with disagreement-audit tables; their published methodology informed this story's aggregate_labels rules. Amazon's internal Buyer Risk team uses P@95R rather than F1 on the analogous fake-review classifier — direct precedent.

Math validation — cost. LightGBM endpoint on ml.m5.2xlarge at $0.461/hr × 4 instances avg × 24h × 30d = $1,328/month. Student model endpoint on ml.g5.xlarge at $1.408/hr × 2 instances avg (firing on 25% uncertain band of 80K reviews/day, ≈ 20K/day, ≈ 0.23 RPS sustained, sized for 4× headroom) × 24h × 30d = $2,028/month. Combined serving: $3,356/month. Per-inference: $3356 / (80K × 30) = $0.0014 per review, or $1.40 per 1K — under the $0.044/1k contract by 30×. Training cost: LightGBM ~$8/run + student ~$22/run = $30/run × 4 runs/month = $120/month, immaterial against serving.

Math validation — latency. LightGBM single-prediction on m5.2xlarge measured at ~3 ms (TF-IDF transform 1.5 ms + tree inference 0.8 ms + behavioral assembly 0.5 ms + serialization 0.2 ms). Student model on g5.xlarge with FP16 batch-1 measured at ~110 ms (tokenization 4 ms + inference 95 ms + post 11 ms). Stacker LR ~0.05 ms. Add ~5 ms for SageMaker request overhead and ~3 ms for application gateway. Fast-path p50 ≈ 11 ms; slow-path p50 ≈ 121 ms; weighted by 75/25 split, overall p50 ≈ 39 ms. p95 ≈ 195 ms. Within the 200 ms p95 contract.

Math validation — label volume. 7 attack classes × 150 examples/class JP minimum × 7 × 150 EN minimum = 2,100 minimum examples per retrain to satisfy data-validation. Vendor produces ~5K/month = ~1,150/week (split EN/JP roughly 50/50, so ~575 per language). Programmatic produces ~50K/week (heavily skewed to common patterns; ~30K spam, ~20K legit). Adversarial-catalog injects ~500/week new examples (red-team), retained forever in the regression set. Cumulative training set after 4 weeks: vendor 4,600 + programmatic 200K (sampled to 30K) + adversarial 2,000 + LLM-distilled on uncertain band 8K = ~44K per retrain, more than satisfying the 2,100-floor and providing per-attack-class coverage.

Math validation — adversarial drift detection lag. Red team produces ~500 new patterns/week injected into the daily-refresh eval set. A new pattern that appears in production is detected when it appears in the daily eval set (1-day delay) AND its recall drops below 0.85 (depends on how invasive). For a clearly-novel pattern, detection lag is 1-2 days. For a subtle variation of an existing pattern, the eval-set recall metric may not separate it from the parent class until the red team flags it as a separate pattern (1-7 days). Worst case 7-day detection lag is the contract.


Cross-Story Interactions

Edge Direction Contract Conflict mode
US-MLE-07 → US-MLE-03 (ABSA) provides spam-cleaned 30-day review corpus Tuesday 09:30 JST delivery so ABSA Wednesday retrain reads clean window If US-MLE-07 batch misses the 09:30 deadline, ABSA falls back to last-week's cleaned corpus and re-runs on Friday. Documented in Runbooks/spam-batch-late.md.
US-MLE-07 → Trust & Safety Dashboard provides per-attack-pattern metrics per-pattern recall + FPR + appeal rate published every 5 min If T&S dashboard is unavailable, metrics buffer in CloudWatch for 48h.
US-MLE-07 → US-MLE-06 (Recommendation, indirect) spam-cleaned reviews flow into ABSA aspect signal which feeds rec transitive; pinned via US-MLE-03's promotion-version stamp Same as US-MLE-03 → US-MLE-06 edge in that story.
US-MLE-07 ↔ US-MLE-01 (Intent Classifier) shared adversarial test set both stories' adversarial sets are cross-checked for prompt-injection examples that span intent and spam When US-MLE-01 augments its adversarial set, those examples are reviewed for indirect_injection overlap and added here too (and vice versa).
US-MLE-07 ← Threat-Intel Feeds external feeds publish spam patterns feeds drop into the adversarial catalog; class owner triages If a feed is compromised (poisoned patterns), source-weight on that feed is dropped to 0 and feed is quarantined.
US-MLE-07 ↔ GenAI Guardrails Team indirect_injection class is shared spam classifier catches at write-time; GenAI prompt-shield catches at retrieval-time. Defense in depth. Coordinated quarterly red-team; shared payload-pattern catalog. Reference: Ground-Truth-Evolution/ML-Scenarios/04-spam-review-adversarial-evolution.md.
US-MLE-07 → Cost-Optimization US-04 (compute) influences serving cost via slow-path % slow-path % is itself a tunable; FinOps participates in band-tuning A FinOps cost-driven band-narrowing must not violate the 0.5% FPR contract. Coordinated change request.
US-MLE-07 ← US-MLE-05 (Embedding Adapter) indirect: student model's text encoder updates when shared mBERT base updates when US-MLE-05 promotes a new embedding base, this story's student-model fine-tune is re-run Drift hub catches this; coordinated retrain.

Rollback & Experimentation

Shadow Mode Plan

  • Duration: 24 hours minimum, 48 hours if a major attack class was added in the last 7 days (extra time to validate adversarial coverage on real traffic patterns).
  • Sample size: ~80K reviews × 1 day = 80K comparisons; statistical power adequate for detecting 1% prediction-disagreement shift on the ~1.2% spam baseline.
  • Pass criteria: per-request prediction agreement with v47 between 92% and 99% (high floor because spam is rare; v48 should agree with v47 on most legit reviews and disagree only on the spam tail). Latency p99 ≤ 1.2× v47.
  • Slice criteria: agreement on each language slice within ±3 percentage points of the global rate. A model that agrees 96% globally but only 88% on JP fails shadow.
  • Adversarial criteria: no per-attack-pattern recall regression > 5% absolute on the daily-refresh eval set during shadow.

Canary Thresholds

  • Phase A (1% traffic, 7 days): live FPR ≤ 0.5% (measured via appeal rate); per-attack-pattern recall on inflight red-team injections ≥ baseline; auto-abort on the four canary-daemon conditions.
  • Phase B (5% traffic, 3 days): same as Phase A plus per-cohort suppression-rate regression ≤ 0.10 (catches per-cohort over-blocking).
  • Phase C (25% traffic, 7 days): same as Phase B plus cost metric within ±15% of baseline (slow-path % must not balloon).

Kill-Switch Flags

  • spam_classifier_promotion_enabled (default: false; SSM /manga-ml/spam/promotion_enabled) — when false, weekly trigger logs and exits.
  • spam_classifier_canary_pause (default: false) — halts traffic shift at current canary stage.
  • spam_classifier_slow_path_enabled (default: true) — when false, all reviews short-circuit through LightGBM-only (used as an emergency cost-cut during incident response; trades precision for cost).
  • global_ml_freeze (default: false) — overrides the above; applies to all 8 stories.

Quality Regression Criteria (Hard Rollback)

A canary that satisfies any one of these conditions is automatically rolled back:

  • FPR exceeds 0.7% for any 1-hour window (2σ above the 0.5% contract).
  • Appeal rate per 1K suppressions exceeds 80 for any 1-hour window.
  • Per-attack-pattern recall on any active pattern regresses by > 0.05 absolute for any 24-hour window.
  • p99 latency exceeds 500 ms for any 5-minute window.
  • Slow-path % exceeds 50% for any 1-hour window (would explode student-model cost and signal that LightGBM has lost calibration).
  • Customer-reported incident on a published review that appears traceable to a misclassification (manual override by SRE on-call).

The rollback is via SageMaker traffic shift to v47 (SLA: 60 seconds), reverting both LightGBM and student endpoints atomically; the gateway pins to the pair-version tag and refuses to serve a mismatched pair.


Multi-Reviewer Validation Findings & Resolutions

S1 — Must Fix Before Production

ML Scientist lens: The stacker LR is fit on the same training-and-validation split as the L1/L2 base learners, which leaks predictions back into the stacker through the features. Resolution: stacker is fit on a held-out stacker-fold that neither L1 nor L2 saw during training; this is the standard out-of-fold-prediction stacking pattern. fit_stacker.py uses 5-fold OOF predictions on training data plus the unseen validation fold to fit the stacker.

Trust & Safety Lead lens: The adversarial loss-weight of 3.0× was set heuristically; without ablation, it could be over- or under-tuning. Resolution: weekly retrain logs the loss-weight as a hyperparameter; quarterly ablation runs at {1.0, 2.0, 3.0, 5.0} weights and the calibration job picks the weight that maximizes per-attack-pattern recall while keeping P@95R ≥ 0.93.

Application Security / Privacy lens: Vendor-provided adversarial examples may inadvertently contain real PII from past attack instances. Resolution: adversarial catalog ingest pipeline runs PII redaction before storage; entries flagged pii_redacted=true. Quarterly audit verifies no raw PII has leaked into the catalog.

SRE lens: Coordinated promotion of two model artifacts increases blast radius if either artifact fails to load on canary instance startup. Resolution: pre-warm hooks load both models before traffic shift; failure to load either triggers immediate rollback to v47; documented in Runbooks/spam-coordinated-promotion.md.

S2 — Address Before Scale

Data Engineering lens: The 4-source disagreement-audit table grows ~10K rows/week and has no retention policy. Resolution: scheduled S3 lifecycle policy: 90 days hot, 12 months glacier with Runbooks/spam-audit-glacier-restore.md documented for the calibration job.

FinOps lens: At peak, slow-path % can spike to 35% during attack waves (LightGBM uncertain on novel patterns), which doubles student-model cost during precisely the moments when the system is most needed. Resolution: budget for attack-wave bursts (~$500/month additional cap); slow-path-% > 40% for 1h triggers a SEV-4 review (not auto-rollback because the precision is more important during attacks).

ML Scientist lens (S2): Offline-online correlation has been measured at 0.74 over the last 6 retrains, comfortably above the 0.6 trustworthy floor. The correlation is computed against live FPR (28d) as the online metric. As the adversarial mix evolves, recalibrate. Resolution: quarterly calibration job; results land in registry metadata.

Trust & Safety Lead lens (S2): Per-attack-pattern owner mapping is currently informal; a class without an owner can have its freshness SLA missed silently. Resolution: every attack class in the catalog manifest carries an owner field; the daily freshness check pages the owner if SLA is missed.

S3 — Future Work

Principal Architect lens: The two-model ensemble (LightGBM + student) is operationally heavier than a single model. As the student model continues to improve via distillation, a v2 backlog item is to evaluate whether a single calibrated student can replace the ensemble at lower latency. Tracked; revisit if student p95 falls below 50 ms via better quantization.

ML Scientist lens (S3): Online learning re-evaluation in 6 months. Triggers documented; explicit calendar check.

SRE lens (S3): Multi-region active-active. Currently ap-northeast-1 only; an EU region extension would require cross-region adversarial-catalog replication and per-region threat-intel feeds. Tracked separately under data-residency expansion roadmap.

Trust & Safety Lead lens (S3): Pivot away from style-based AI-generated detection as legitimate users adopt AI writing assistants. The current ai_generated_review class will become a noisier signal over the next 12-24 months; the architecture's graceful-demotion path (style features → behavioral features → identity features) is documented in Ground-Truth-Evolution/ML-Scenarios/04-spam-review-adversarial-evolution.md and tracked as a v3 backlog item.