Question 17
Domain 2: Core Machine Learning, AI, and Transformer FoundationsWhat is the primary role of the feedforward layers in a transformer architecture?
Correct answer: D
Explanation
Feedforward layers in a transformer act on each token separately, applying the same position-wise transformation to every token representation. This matches the role of the feedforward network after attention, which is to refine each token’s features independently rather than mix information across positions.
Why each option is right or wrong
A. To normalize the output of the attention layers by applying batch normalization after each self-attention computation step
B. To add positional information to the token embeddings by encoding absolute sequence positions into learned vector representations
C. To compute attention scores between tokens by calculating the scaled dot-product of query and key matrices across all positions
D. To apply position-wise transformations independently to each token representation
In the standard transformer block described by Vaswani et al. (2017), the position-wise feed-forward network is applied separately to each token embedding after attention, using the same two linear layers with a nonlinearity at every position. Its purpose is to transform the features within each token vector independently, while cross-token interaction is handled by the self-attention sublayer, not by the feedforward sublayer.