Scenario 04: Japanese Content Fluency Evaluation

Parent Skill: 01-fm-output-quality-assessment.md Focus: Evaluating fluency when MangaAssist handles Japanese manga titles, character names, and bilingual content.

Questions

Easy

What does fluency mean for a chatbot? How is fluency different from factual accuracy and relevance?
Why is fluency especially challenging for MangaAssist? Give two examples of fluency issues specific to a JP manga chatbot.
How would you score fluency on a 1-5 scale? Provide concrete MangaAssist examples for each level.

Medium

Japanese title romanization consistency: MangaAssist encounters titles in kanji, romaji, and English. "進撃の巨人" = "Shingeki no Kyojin" = "Attack on Titan." How do you evaluate fluency when the chatbot must handle all three forms?
Code-switching fluency: A user asks in English but includes Japanese terms ("I'm looking for a good isekai manga with a strong MC"). How do you evaluate whether the chatbot's response handles the code-switching naturally?
Automated fluency scoring: Design an automated fluency evaluator for MangaAssist. What metrics would you use (perplexity, grammar check, readability)? What are the limitations?

Hard

Fluency vs. accuracy conflict: The most fluent response to "When does Volume 41 come out?" is "Volume 41 will be released on March 15!" — but that date is fabricated. A less fluent but truthful response is "I don't currently have release date information for Volume 41. I recommend checking the publisher's website." How do you balance fluency scoring against accuracy guardrails?
Honorifics and cultural fluency: Manga fans expect certain cultural conventions: "-san," "-sensei" for authors, "mangaka" instead of "comic artist." How do you evaluate cultural fluency without hard-coding rules for every convention?
Response length fluency: The chatbot gives a 3-paragraph answer to "What's the price?" Factually accurate, but the length itself is a fluency issue. Design a fluency evaluator that considers response length appropriateness relative to query complexity.

Very Hard

Multi-lingual fluency degradation: MangaAssist supports English queries about Japanese content. You notice that responses mentioning 5+ Japanese proper nouns (character names, places, techniques) have lower fluency scores. The LLM is token-splitting Japanese names into subwords, degrading generation quality. How do you diagnose and fix this at the evaluation level?
Fluency evaluation calibration across annotators: Three annotators score the same response: 4, 3, 5. How do you calibrate fluency evaluation across human annotators? Design an inter-annotator agreement protocol and a rubric that reduces subjectivity for manga chatbot responses.
Dynamic fluency standards: Fluency expectations differ by intent. chitchat should be warm and conversational. order_tracking should be concise and structured. recommendation should be enthusiastic but informative. Design an intent-aware fluency evaluation framework that applies different rubrics per intent.