NCA-GENL Practice Q22

A. To compress the KV cache memory footprint during inference by quantizing stored key-value pairs to lower precision formats like INT8 or FP8

B. To reduce the computational complexity of the self-attention mechanism by approximately half, since only the lower triangular portion of the attention matrix needs to be calculated

C. To enable bidirectional attention patterns where each token can attend to both preceding and following positions, improving contextual understanding of the full sequence

D. To prevent the model from attending to future tokens during training, ensuring each position only depends on previous positions

In decoder-only transformers, the self-attention mask is applied so position *t* cannot assign attention weight to any token at positions *> t*; mathematically, the attention score matrix is upper-triangular masked before the softmax, so future logits are excluded from the context. This is required by the autoregressive training objective, where the model learns to predict the next token from prior tokens only; without the mask, the network would leak target information from later positions and the loss would be artificially low.

Question 22

Explanation

Why each option is right or wrong