NCA-GENL Practice Q11

A. Better memory efficiency

B. Parallel processing capability

The defining architectural change is that the Transformer removes recurrence and uses self-attention over the full sequence, so the model does not need to wait for hidden state updates at each time step. In the original Transformer paper (Vaswani et al., 2017, "Attention Is All You Need"), this allows all positions to be computed simultaneously during training and inference, unlike RNNs which are inherently sequential across time steps.

C. Smaller model size

D. Faster inference only

Question 11

Explanation

Why each option is right or wrong