AIP-210 Practice Q17

A. By memorizing labeled examples from a training set

B. By interacting with an environment and maximizing cumulative reward signals

Reinforcement learning is defined by an agent–environment loop: the agent selects actions, receives state transitions and reward feedback, and updates its policy to maximize expected cumulative reward over time. The operative objective is not immediate accuracy or labeled supervision, but the long-run return (the sum of rewards across steps), which is the standard criterion used in RL formulations.

C. By computing gradients of cross-entropy loss on labeled classes

D. By clustering states based on similarity without using rewards

Question 17

Explanation

Why each option is right or wrong