NCA-GENL Practice Q17

A. To normalize the output of the attention layers by applying batch normalization after each self-attention computation step

B. To add positional information to the token embeddings by encoding absolute sequence positions into learned vector representations

C. To compute attention scores between tokens by calculating the scaled dot-product of query and key matrices across all positions

D. To apply position-wise transformations independently to each token representation

In the standard transformer block described by Vaswani et al. (2017), the position-wise feed-forward network is applied separately to each token embedding after attention, using the same two linear layers with a nonlinearity at every position. Its purpose is to transform the features within each token vector independently, while cross-token interaction is handled by the self-attention sublayer, not by the feedforward sublayer.

Question 17

Explanation

Why each option is right or wrong