Question 13
Domain 4: Model DeploymentA team needs to serve predictions with low latency from a deployed machine learning model. Which deployment approach best fits this requirement?
Correct answer: B
Explanation
Realtime inference is used when a model must return predictions through an endpoint with low latency. Querying a model for this purpose is done by sending requests to the model endpoint. — official.txt
Why each option is right or wrong
A. Store predictions in advance and retrieve them later without calling an endpoint
Realtime inference focuses on low-latency serving through an endpoint for predictions.
B. Deploy the model to a realtime inference endpoint and query that endpoint for predictions
The source material defines this objective as deploying a model for realtime inference and querying it for predictions, with a focus on low-latency serving through an endpoint. Because the team needs low-latency predictions, a model endpoint queried directly for predictions is the deployment method that matches the stated requirement.
C. Export the model only for offline analysis and avoid exposing any query interface
Low-latency serving requires a model endpoint that can be queried for predictions.
D. Use a deployment method designed only for delayed prediction results in batch workflows
The objective addresses realtime inference, not delayed batch-style prediction workflows.