Question 20
Domain 2: Core Machine Learning, AI, and Transformer FoundationsWhich metric is most appropriate for evaluating text summarization tasks?
Correct answer: B
Explanation
ROUGE is designed for summarization because it measures overlap between a generated summary and reference summaries, especially n-grams and recall. Text summarization is evaluated by how well the output matches key content from the source or human summaries, making the "ROUGE score" the standard metric.
Why each option is right or wrong
A. BLEU score
B. ROUGE score
ROUGE is the standard automatic evaluation metric for summarization because it compares a system-generated summary against one or more human reference summaries using n-gram overlap, longest common subsequence, and skip-bigram variants. In practice, ROUGE-1, ROUGE-2, and ROUGE-L are the common forms used to score content coverage and recall, which is exactly what summarization tasks are judged on.
C. Perplexity
D. F1-score