Question 22
Domain 2: Core Machine Learning, AI, and Transformer FoundationsWhy do decoder-only architectures like GPT use causal (masked) self-attention?
Correct answer: D
Explanation
Decoder-only models use causal, or masked, self-attention so each token can attend only to earlier tokens. This enforces the autoregressive rule that prediction at position t depends on "previous positions" and not future ones, preventing information leakage during training.
Why each option is right or wrong
A. To compress the KV cache memory footprint during inference by quantizing stored key-value pairs to lower precision formats like INT8 or FP8
B. To reduce the computational complexity of the self-attention mechanism by approximately half, since only the lower triangular portion of the attention matrix needs to be calculated
C. To enable bidirectional attention patterns where each token can attend to both preceding and following positions, improving contextual understanding of the full sequence
D. To prevent the model from attending to future tokens during training, ensuring each position only depends on previous positions
In decoder-only transformers, the self-attention mask is applied so position *t* cannot assign attention weight to any token at positions *> t*; mathematically, the attention score matrix is upper-triangular masked before the softmax, so future logits are excluded from the context. This is required by the autoregressive training objective, where the model learns to predict the next token from prior tokens only; without the mask, the network would leak target information from later positions and the loss would be artificially low.