LOCAL PREVIEW View on GitHub

ML Scenario 06 — ABSA Aspect Emergence

TL;DR

The ABSA (aspect-based sentiment analysis) model behind the Review-Sentiment MCP extracts (aspect_span, polarity) pairs from each review across a fixed taxonomy: art, story, pacing, characters, translation, value, print_quality. Two years in, the aspect schema is silently incomplete. New aspects users care about are appearing in reviews — vertical_scroll_ux (webtoons), panel_layout (manhwa), ai_art_authenticity, chapter_release_consistency, binding_durability (collectors' editions), creator_treatment (publisher controversies). The model finds none of them — its taxonomy doesn't include them. Reviews that center on these new aspects produce empty or wrong extractions, and the bot's "what readers say about X" summary misses what readers actually say. Aggregate aspect-extraction F1 is unchanged at 0.83 because old aspects still dominate volume; per-aspect F1 on the new aspects is undefined (the labels don't exist). The fix shape is aspect-schema as living artifact, an aspect-discovery loop that promotes recurring novel-aspect candidates into the taxonomy, schema-versioned eval and training, and a graceful "uncovered-aspect" path that surfaces the gap rather than hiding it.

Context & Trigger

  • Axis of change: Requirements (the product surface area expanded — webtoons, AI-art controversies, collector editions, social/creator-treatment context — and the original taxonomy didn't grow with it).
  • Subsystem affected: RAG-MCP-Integration/04-review-sentiment-mcp.md — the ABSA model whose aspect breakdown drives the bot's review-summary feature. The polarity classifier in scenario ML-02 is upstream; this scenario is downstream of polarity.
  • Trigger event: A content-ops audit notices that for several webtoon and manhwa titles, the bot's review summary reads "readers praised the art and story" while the reviews themselves overwhelmingly discuss vertical scrolling, panel layout, and translation quality. The aspect taxonomy was last updated 18 months ago.

The Old Ground Truth

Original setup:

  • Aspect taxonomy (7 classes) defined by content ops and ABSA modeling team in Y0.
  • 80K labeled review-aspect spans annotated against the taxonomy.
  • Model: sequence-tagger (BERT-CRF) producing aspect spans + polarity per span.
  • Eval: macro-F1 on aspect-span detection + per-class polarity accuracy.
  • Retrain cadence: annual.
  • Reasonable assumptions:
  • The aspect taxonomy is a stable list — readers care about the same things over time.
  • Macro-F1 on the held-out set tracks production performance.
  • New aspects, if they emerge, can be added at the next scheduled retrain.

What this gets wrong: aspects emerge organically as the product expands and as the cultural conversation around manga shifts; held-out F1 is computed against a frozen taxonomy and cannot detect missing-aspect failures by construction; "we'll add it next year" is an admission that the system runs blind on new aspects in the meantime.

The New Reality

  1. The aspect schema is alive. The catalog now includes manhwa (vertical scroll), manhua (different art-tradition), webtoons (chapter-by-chapter, color); each format brings genuinely-new aspects that don't fit the old taxonomy.
  2. Some old aspects are deprecated. "Print quality" used to be central; in a digital-first audience, it applies to a shrinking minority. Aspect importance shifts.
  3. External events create new aspects. AI-generated art controversies created a new aspect, ai_art_authenticity, in 2024 that didn't exist before. Publisher controversies (mangaka treatment) created creator_treatment. These aren't predictable from pure usage patterns.
  4. The same word can mean different aspects in different contexts. "Pacing" in shounen reviews is plot momentum; "pacing" in webtoon reviews is chapter release cadence. Schema-blind extraction conflates them.
  5. Held-out F1 is structurally blind. A review where 70% of content is about an out-of-taxonomy aspect (vertical scroll UX) but the labeled set has it as [] produces an "empty extraction" the model didn't learn was an error.
  6. The user-facing summary is downstream-broken. "What readers say" is computed by aspect-aggregation; missing aspects → wrong summary even when the polarity layer is correct.

Why Naive Approaches Fail

  • "Add the new aspects to the taxonomy and retrain." Reactive only; identifies known-missing aspects, doesn't surface unknown-missing.
  • "Use a generic 'other' bucket." Helps detect that something exists but doesn't tell you what it is. The bot can't summarize "other = bad" usefully.
  • "Just use an LLM for aspect extraction." LLMs do better at open-vocabulary aspect extraction but are expensive at scale, unstable across versions (GenAI 04 again), and the unstructured output makes downstream aggregation messy.
  • "Trust user reports." Sparse, and users won't tell you "your taxonomy is incomplete."
  • "Cluster reviews and call clusters aspects." Gives candidates but no human-meaningful labels; aspects need to be operationally usable and surface-able.
  • "Wait for the annual schema review." That's how the gap got to 18 months in the first place.

Detection — How You Notice the Shift

Online signals.

  • Empty-extraction rate. Reviews where the model returns no aspect spans. If the rate climbs, the model is missing things.
  • Bot-summary mismatch rate. Content-ops samples summaries vs raw reviews; if the summary misses topics that dominate reviews, the taxonomy is incomplete.
  • Aspect-share over time. Per-aspect share of all extractions. If the dominant aspects haven't changed in 18 months while review volume grows on new content types, the taxonomy is frozen relative to the input.
  • User feedback on summaries. "The summary missed the main complaint." High-signal but slow.

Offline signals.

  • Open-coding pilot. A small (~ 100-review) human pilot where annotators tag aspects without a taxonomy constraint. The aspects they invent reveal the gap.
  • LLM-aided aspect discovery. An LLM extracts aspect candidates open-vocabulary; cluster the outputs; surface high-volume candidates to content ops for taxonomy review.
  • Per-format breakdown. What fraction of webtoon / manhwa / manhua reviews extract zero aspects? Compared to manga reviews. Format-specific gaps are the easiest to spot.

Distribution signals.

  • Token frequency in unmapped review portions. When the model extracts no spans, what tokens dominate the rest of the review? Recurring tokens (e.g., "scroll," "panel," "binding") are aspect candidates.
  • Sentiment / polarity coverage gap. If overall review polarity is computable but few aspects are extracted, the model is missing the substance.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Disco["Aspect-discovery loop"]
        OPEN["Open-coding pilots (quarterly)"]
        LLM["LLM aspect extraction<br/>(open-vocab, sampled)"]
        CLUST["Cluster + dedupe candidates"]
        OPS["Content-ops review<br/>+ label conventions"]
    end

    subgraph Schema["Versioned taxonomy"]
        TAX["taxonomy.yaml<br/>(versioned, signed off)"]
        DEPR["Deprecated aspects retained"]
        NEW["New aspects with effective_from"]
    end

    subgraph Labels["Schema-versioned labels"]
        LBL["Per-review labels keyed on<br/>(taxonomy_version, review_id)"]
        REGRESS["Regression set (old + new)"]
    end

    subgraph Train["Training"]
        TR["BERT-CRF tagger<br/>+ schema-aware decoder"]
        F1["Per-aspect F1<br/>(blocking)"]
        UNCOV["Uncovered-aspect rate<br/>(blocking)"]
    end

    subgraph Serve["Serving"]
        TAG["Aspect tagger"]
        UNK["Uncovered-aspect path<br/>('readers also discuss X')"]
        SUM["Bot summary"]
    end

    OPEN --> CLUST
    LLM --> CLUST
    CLUST --> OPS
    OPS --> TAX
    TAX --> NEW
    TAX --> DEPR
    TAX --> LBL
    LBL --> TR
    REGRESS --> TR
    TR --> F1
    TR --> UNCOV
    F1 -->|gate| TAG
    UNCOV -->|gate| TAG
    TAG --> UNK
    TAG --> SUM
    UNK --> SUM

    style TAX fill:#fde68a,stroke:#92400e,color:#111
    style NEW fill:#dbeafe,stroke:#1e40af,color:#111
    style UNCOV fill:#fee2e2,stroke:#991b1b,color:#111
    style UNK fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — schema-versioned everything

The taxonomy is versioned:

taxonomy:
  version: 7
  effective_from: 2026-04-01
  aspects:
    - id: art
      label: "Art / illustration"
      example_phrases: ["the art is gorgeous", "panels are clean"]
      synonyms: ["illustration", "drawing"]
      effective_from: 2024-01-01
    - id: vertical_scroll_ux
      label: "Vertical scroll experience"
      formats: ["webtoon", "manhwa"]
      effective_from: 2026-04-01
      promoted_from_candidate: "scroll_quality_2026Q1"
    - id: print_quality
      effective_from: 2024-01-01
      effective_to: null   # still active but importance declining
    # ...

Labels are keyed on (taxonomy_version, review_id, span). When the taxonomy adds an aspect (v6 → v7), labels under v6 may be re-labeled or carried forward unchanged. The label for a webtoon review under v6 (no vertical_scroll_ux) is not wrong under v6; it's just under a smaller schema. This versioning prevents historical labels from being silently invalidated.

2. Pipeline layer — aspect discovery as a recurring process

A quarterly cycle:

  1. Sample 1000 reviews across formats / titles / regions.
  2. Open-coding pilot: 2 human annotators independently tag aspects without taxonomy constraint. Aggregated, deduplicated.
  3. LLM-aided discovery: an LLM with a focused prompt extracts aspect candidates from a larger sample (10K reviews). Output clustered.
  4. Candidate review: content ops + ABSA team review candidates. Promote to taxonomy if (a) coverage is consistent (recurring across many reviews); (b) operationally meaningful (the bot can summarize it); © distinct from existing aspects.
  5. Schema bump: taxonomy version increments. Re-label sample under new schema. Re-train.

The LLM is a candidate generator, not a labeler — humans decide what goes into the taxonomy. This avoids the "LLM hallucinates an aspect that doesn't really exist" failure.

3. Serving layer — uncovered-aspect path

A small but critical addition: the model must surface that something is missing, not silently drop it.

def extract_aspects_with_coverage(review):
    spans = tagger.predict(review)
    covered_text = " ".join([s.text for s in spans])
    coverage_ratio = len(covered_text) / len(review.text)
    if coverage_ratio < 0.3:
        # Model didn't cover much of the review — likely a new-aspect review
        candidates = lightweight_candidate_extractor(review)
        return spans, UncoveredHint(candidate_phrases=candidates)
    return spans, None

The bot's summary then handles UncoveredHint:

"Most readers praise the story and characters. Some discussed [vertical scrolling, panel layout] — these are topics our analysis is still learning to summarize."

This is honest and surfaces the gap to the user and the team. The "still learning" copy is honest; pretending the system fully understands when it doesn't is a confidence trap.

4. Governance — taxonomy as a product artifact

The aspect taxonomy is a product artifact, not a model artifact. Three governance pieces:

  • Cross-functional ownership. Content ops + ABSA modeling + product. Changes go through a small review board, not a single team's decision.
  • Audit log of taxonomy changes. Every version bump records what was added/deprecated, why, with examples. Useful for content moderation, regulatory questions, and historical model interpretation.
  • Publisher / vendor coordination. Some aspect changes (e.g., adding creator_treatment) intersect with editorial policy. Coordinate with the legal / brand-safety teams before promotion.

Trade-offs & Alternatives Considered

Approach New-aspect coverage Cost Stability Verdict
Annual taxonomy review Slow Low High Original — falls behind
Quarterly review only Faster Medium High Better but still reactive
Aspect-discovery loop + versioned schema + uncovered-aspect path Yes Medium-High Medium Chosen
LLM open-vocab extraction in production Highest Very high Variable Useful for discovery, brittle as primary serving
"Other" catch-all bucket Detects-only Low High Doesn't help summaries
Per-format separate models Yes (per format) High High Operational duplication; reserve for major formats

The discovery loop is the lift; the rest is structure that supports it.

Production Pitfalls

  1. Discovery-loop fatigue. Quarterly open-coding requires consistent annotator and ops time. Without owner accountability, the loop slips and the schema falls behind. Treat the loop as a calendar item, not a project.
  2. LLM candidate noise. LLMs hallucinate aspect candidates that aren't really in the data. Always have humans validate; never auto-promote.
  3. Schema bump invalidates historical labels. When vertical_scroll_ux enters the taxonomy at v7, historical webtoon reviews labeled under v6 don't have the aspect tagged. Re-label a sample of historical reviews under the new schema for the regression set; don't pretend old labels covered the new aspect.
  4. Cross-aspect confusion. "Pacing" means plot pace in shounen and chapter-release pace in webtoons. Add format-conditioning features so the model learns context-specific meaning. Failing to do this is a source of real F1 regression on the older aspect.
  5. Deprecated aspects stay tagged. When print_quality deprecates, don't drop it — keep tagging it (smaller share of reviews mention it). The signal is the trend, not the binary presence.
  6. Coverage-ratio threshold is brittle. The 0.3 coverage threshold for the uncovered-aspect path is tunable. Calibrate against held-out data; a too-strict threshold misses the gap, a too-lax one fires on routine reviews.
  7. User-facing copy on uncovered aspects matters. "We don't know what these readers said" sounds dumb. "Our analysis is still learning to summarize this topic" is honest and shows ongoing work.

Interview Q&A Drill

Opening question

Your ABSA model has been live for 18 months on a fixed 7-aspect taxonomy. Aggregate aspect-detection F1 is unchanged at 0.83. Content ops audits the bot's review summaries on webtoon titles and finds the summary is missing what readers actually discuss — vertical scrolling, panel layout. What's the architectural fix?

Model answer.

The taxonomy is frozen, the eval is built around the frozen taxonomy, and the world has moved. F1 of 0.83 is on aspects that were in the taxonomy 18 months ago; new aspects (vertical scroll, panel layout, AI-art controversies) generate empty extractions because the model can't tag what isn't in its label space. Held-out F1 cannot detect this by construction.

The fix has four pieces.

(1) Aspect taxonomy as a versioned, living artifact. taxonomy.yaml with version, effective_from, effective_to per aspect. Cross-functional ownership (content ops + ABSA modeling + product). Audit log of changes.

(2) Quarterly aspect-discovery loop. Open-coding pilots with humans tagging unconstrained on a 1K-review sample. LLM-aided open-vocabulary extraction on a 10K-review sample, clustered. Content ops + ABSA review the candidates and promote (or reject) into the taxonomy.

(3) Schema-versioned labels and re-training. Labels keyed on (taxonomy_version, review_id). Regression set tracks performance on both old and new aspects. Per-aspect F1 as the gate, not just aggregate.

(4) Uncovered-aspect path at serve time. When the model covers less than ~30% of a review's text with extractions, surface a hint that "readers also discussed [candidate phrases]" — honest about the gap rather than silently dropping signal.

The conceptual move: the aspect schema is a product surface that evolves with the catalog and the cultural context. Treating it as a fixed list is the original failure. The architecture treats schema evolution as a routine process, not a one-off project.

Follow-up grill 1

Quarterly aspect discovery sounds expensive. Can't you just run an LLM on every review and skip the static model entirely?

You can, and at small scale that's defensible. But three problems at production scale.

(1) Cost. Reviews-per-day × LLM tokens × API rates is a non-trivial monthly bill. The fine-tuned BERT-CRF model is orders of magnitude cheaper at serve time.

(2) Stability. LLM versions change (GenAI 04). The same review run today and next month could produce different aspect extractions. For a feature that drives downstream summaries shown to users, that instability is unacceptable.

(3) Aggregation. The summary feature aggregates aspects across thousands of reviews. Free-text aspect labels are hard to aggregate ("vertical scrolling" vs "scroll feel" vs "scroll mechanic"). A canonical aspect taxonomy makes aggregation tractable.

So the LLM serves as candidate generator in the discovery loop (cheap at sample sizes of 10K, run quarterly) but not as the primary serving model. The static model handles production volume on a stable taxonomy. The taxonomy evolves under human supervision.

Follow-up grill 2

When you bump the taxonomy from v6 to v7, your historical labels under v6 don't have the new aspect tagged. How do you avoid F1 looking artificially bad on the new aspect?

Don't compute v7-aspect F1 against v6 labels. Specifically:

(1) Per-aspect-version F1. Each aspect has an effective_from date. F1 on vertical_scroll_ux is computed only on reviews labeled under v7+. Historical v6 labels don't enter that metric.

(2) Re-label a sample under the new schema. Take a random sample of historical reviews (say 2K) and re-label them under v7. This sample becomes the test set for the new aspect's F1.

(3) Holistic regression. The regression set ensures the new model doesn't hurt old aspects' F1. The new aspect's F1 starts from zero and grows; the old aspects' F1 must hold.

The honest framing: the model couldn't have been good at vertical_scroll_ux before v7 because the aspect didn't exist. Reporting "0% F1 on vertical_scroll_ux pre-v7" is meaningless; "0.78 F1 on v7+" is the metric.

Follow-up grill 3

Your "uncovered-aspect path" surfaces phrases the model didn't tag. What stops it from surfacing nonsense (typos, slang, off-topic ramblings)?

Two filters before surfacing.

(1) Frequency threshold. A candidate phrase must appear in N reviews of the same title (or class of titles) to surface. One-off ramblings don't pass; a dozen reviews mentioning "the binding falls apart" do. This is a corpus-level filter applied at summarization time.

(2) Lightweight semantic filter. The candidate phrase must look like an aspect (noun phrase or aspect-style phrasing), not random text. A small classifier trained to distinguish aspect-like phrases from other content.

(3) Conservative serving. The uncovered-aspect message is shown only when (a) coverage ratio is low and (b) the candidate filter found at least 2 plausible candidates. If neither, the system falls back to the standard summary without the hint.

The deeper commitment: the uncovered-aspect path is for honest gaps, not for surfacing noise. False-positive surfacing ("readers also discussed [random typo]") is worse than silence; calibrate to err on the side of silence.

Follow-up grill 4

Aspect labels are expensive to produce. The training set is 80K, and re-labeling a substantial fraction at every taxonomy bump is a budget conversation. How do you keep re-labeling cost bounded?

Three knobs.

(1) Differential re-labeling. When a new aspect is added, only re-label reviews where the new aspect is likely present. Use simple keyword heuristics (e.g., "scroll" tokens for vertical_scroll_ux) to identify candidates, label only those. Reviews irrelevant to the new aspect keep their old labels.

(2) Active learning on the new aspect. Start with a seed of human-labeled examples (~ 500), train a quick model on the new aspect, use it to surface candidates for labeling, validate. Iterate. Cuts labeling effort by 3–5× compared to random sampling.

(3) Stable-aspect re-labeling is rare. Most aspects don't change between schema versions. Re-labeling effort concentrates on the delta (new aspects, deprecated aspects with edge cases, aspects whose meaning shifted), not the entire corpus.

In practice, a taxonomy bump might require re-labeling 5–15K reviews (~ 5–20% of the training set). Spread over a quarter, this is a sustainable cadence with a small annotation team.

Architect-level escalation 1

The company adds support for fan-fiction, a new content type with reviews that center on aspects the manga taxonomy doesn't even gesture toward (canon-faithfulness, OC quality, ship satisfaction). Do you extend the existing taxonomy or build a separate one?

Both — federated taxonomy.

(1) Shared core aspects that span all content types: art (where applicable), story, pacing, characters. These keep cross-content-type aggregation possible.

(2) Per-content-type extensions. Manga has print_quality (declining); manhwa has vertical_scroll_ux; fan-fiction has canon_faithfulness, oc_quality, ship_satisfaction. Each content type's aspect set extends the core.

(3) Routing at serving time. The content type is known at retrieval. Route to the appropriate aspect-tagger model with the appropriate taxonomy.

(4) Cross-type evaluation is per-shared-aspect. art is comparable across types; per-type extensions are not. Aggregate metrics segment by type.

Federation matters because (a) lumping fan-fiction into the manga taxonomy mis-fits both, (b) building a totally separate stack duplicates effort and breaks core-aspect comparability, © the schema growth would be unsustainable. Federation gives both type-specific richness and cross-type aggregation.

Architect-level escalation 2

A regulator says "your bot's summaries make claims about user opinion. Show that the aspects you summarize accurately represent the population of reviews." How do you respond?

The chain of evidence:

(1) Taxonomy versioning — at any historical time, the aspect taxonomy in effect is reproducible from taxonomy.yaml@version. The taxonomy has explicit definitions, examples, and effective dates.

(2) Per-review aspect labels — each review's extracted aspects are stored with (taxonomy_version, model_version, review_id, spans, polarities). The chain from raw review to summary aspect is auditable.

(3) Aggregation methodology — the summary's "readers say X" claim is a function of aggregated extractions. The methodology (e.g., "an aspect is reported in the summary if mentioned in ≥ 5% of reviews with consistent polarity") is documented.

(4) Coverage disclosure — when the uncovered-aspect path fires, the summary explicitly discloses topics the analysis didn't cover. Honest framing avoids the regulator's "you claimed completeness" challenge.

(5) Per-aspect F1 + sample human audit — the model's aspect-level performance is documented. For each summary, sampling-based human audit confirms or rejects the aspect claim. Audit results are logged.

The harder regulator question: "are there reviews you mis-classified that affected the summary?" The audit-driven answer: yes, sample-based estimation says X% of summaries contain at least one mis-attributed aspect. The architectural commitment is measured fallibility — we don't claim 100% correctness, we report estimated error and disclose uncertainty in the user-facing copy.

The architectural pattern: regulator-readiness is an audit-trail discipline, not a perfection claim. Taxonomy versioned, labels traceable, methodology documented, errors quantified.

Red-flag answers

  • "Add the new aspects and retrain." (Reactive only.)
  • "Use an LLM for aspect extraction in production." (Cost + stability.)
  • "Add an 'other' bucket." (Doesn't help summaries.)
  • "Trust user reports for missing aspects." (Sparse, slow.)
  • "F1 is fine; ship." (F1 is measured against a frozen schema; structurally blind.)

Strong-answer indicators

  • Recognizes the taxonomy as a living artifact, not a constant.
  • Has a discovery loop with humans validating LLM-generated candidates.
  • Schema-versioned labels and per-aspect-version F1.
  • Uncovered-aspect path that surfaces gaps honestly.
  • Federated taxonomy for new content types.
  • Audit-trail thinking for regulatory readiness.