Question 21
Domain 2: Core Machine Learning, AI, and Transformer FoundationsWhat is the key difference between the encoder and decoder components in an encoder-decoder transformer architecture?
Correct answer: C
Explanation
In an encoder-decoder transformer, the encoder processes the input with bidirectional attention, so each token can attend to all others in the source sequence. The decoder uses masked self-attention, which "prevents attending to future tokens" during training and preserves autoregressive generation.
Why each option is right or wrong
A. The encoder typically has more layers than the decoder to better process and encode input representations before passing them to generation
B. The encoder exclusively uses self-attention while the decoder relies solely on cross-attention to encoder outputs for all token interactions
C. The decoder uses masked self-attention to prevent attending to future tokens during training, while the encoder uses bidirectional attention
Under the standard transformer architecture introduced by Vaswani et al. (2017), the encoder’s self-attention is unmasked, so each source token can attend to every other source token in the sequence. By contrast, the decoder’s self-attention is causal/masked: positions are prevented from attending to later positions, which is what enforces autoregressive training and generation one token at a time.
D. The decoder uses larger embedding dimensions and wider feed-forward layers than the encoder to improve generation quality and fluency