Question 28

Domain 4

You downloaded a TensorFlow language model pre-trained on a proprietary dataset by another company, and you tuned the model with Vertex AI Training by replacing the last layer with a custom dense layer. The model achieves the expected offline accuracy; however, it exceeds the required online prediction latency by 20ms. You want to optimize the model to reduce latency while minimizing the offline performance drop before deploying the model to production. What should you do?

A. Apply post-training quantization on the tuned model and serve the quantized model. B. Use quantization-aware training to tune the pre-trained model on your dataset and serve the quantized model. C. Use pruning to tune the pre-trained model on your dataset, and serve the pruned model after stripping it of training variables. D. Use clustering to tune the pre-trained model on your dataset, and serve the clustered model after stripping it of training variables.

Previous Next

Question 28

Explanation

Why each option is right or wrong