Question 3
Domain 2: Evaluation, Tuning, and Quality OptimizationWhat metrics are appropriate for each task?
Correct answer: A
Explanation
These metrics match the task goals: summarization is judged by overlap and content preservation, so ROUGE, faithfulness, and compression ratio fit. QA is scored by exact match, F1, and answer relevance because it measures whether the answer matches the reference and addresses the question. Creative writing relies on diversity, coherence, and user preference since quality is more subjective and style-driven.
Why each option is right or wrong
A. Summarization: ROUGE, faithfulness, compression ratio. QA: Exact match, F1, answer relevance. Creative writing: Diversity, coherence, user preference.
The metric set aligns with the evaluation objective of each task: summarization is typically assessed with ROUGE for n-gram overlap, plus faithfulness to check factual consistency and compression ratio to measure how much the source was condensed. For QA, exact match and F1 are standard reference-based scores, and answer relevance checks whether the response actually addresses the question; for creative writing, diversity, coherence, and user preference are appropriate because there is usually no single gold answer and quality is judged more subjectively.
B. Use accuracy for all tasks.
C. Use BLEU for all tasks.
D. Use user satisfaction for all tasks.