NCA-GENL Practice Q21

A. The encoder typically has more layers than the decoder to better process and encode input representations before passing them to generation

B. The encoder exclusively uses self-attention while the decoder relies solely on cross-attention to encoder outputs for all token interactions

C. The decoder uses masked self-attention to prevent attending to future tokens during training, while the encoder uses bidirectional attention

Under the standard transformer architecture introduced by Vaswani et al. (2017), the encoder’s self-attention is unmasked, so each source token can attend to every other source token in the sequence. By contrast, the decoder’s self-attention is causal/masked: positions are prevented from attending to later positions, which is what enforces autoregressive training and generation one token at a time.

D. The decoder uses larger embedding dimensions and wider feed-forward layers than the encoder to improve generation quality and fluency

Question 21

Explanation

Why each option is right or wrong