Question 26
Domain 3: Applications of Foundation ModelsAn education company is building a chatbot whose target audience is teenagers. The company is training a custom large language model (LLM). The company wants the chatbot to speak in the target audience's language style by using creative spelling and shortened words. Which metric will assess the LLM's performance?
Correct answer: D
Explanation
BLEU measures how closely generated text matches reference text by comparing n-gram overlap, so it can assess whether the chatbot’s output follows the target style and wording patterns. Since the model is being trained to imitate teenage language with “creative spelling and shortened words,” BLEU is the metric used to compare that generated language against expected examples.
Why each option is right or wrong
A. F1 score
F1 score is mainly for balancing precision and recall in classification-style tasks, not stylistic text generation.
B. BERTScore
BERTScore measures semantic similarity using embeddings, not direct matching of slang-like wording patterns.
C. Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
ROUGE is commonly used for summarization by checking recall of reference content, not conversational style imitation.
D. Bilingual Evaluation Understudy (BLEU) score
BLEU is the standard n-gram overlap metric used to compare generated text against one or more reference outputs, so it directly measures how closely the chatbot’s wording matches the expected teen-style examples. In practice, it scores precision over 1- to 4-grams with a brevity penalty, making it suitable when the goal is to reproduce a particular phrasing pattern such as shortened words and nonstandard spellings.