Question 28
Domain 4You downloaded a TensorFlow language model pre-trained on a proprietary dataset by another company, and you tuned the model with Vertex AI Training by replacing the last layer with a custom dense layer. The model achieves the expected offline accuracy; however, it exceeds the required online prediction latency by 20ms. You want to optimize the model to reduce latency while minimizing the offline performance drop before deploying the model to production. What should you do?
Correct answer: A
Explanation
Post-training quantization reduces model size and computation by converting weights and activations to lower precision, which lowers inference latency. Because the model already meets offline accuracy, this is a deployment-time optimization that can “minimiz[e] the offline performance drop” while improving online serving speed.
Why each option is right or wrong
A. Apply post-training quantization on the tuned model and serve the quantized model.
Vertex AI supports post-training quantization for TensorFlow models as a deployment-time optimization, converting floating-point weights/activations to lower precision (typically INT8) to reduce compute and model size, which directly targets serving latency. Because the model already meets the offline accuracy target, this is the least disruptive option: it avoids retraining and is specifically intended to preserve accuracy as much as possible while improving inference speed, unlike more aggressive compression methods that can cause a larger accuracy drop.
B. Use quantization-aware training to tune the pre-trained model on your dataset and serve the quantized model.
C. Use pruning to tune the pre-trained model on your dataset, and serve the pruned model after stripping it of training variables.
D. Use clustering to tune the pre-trained model on your dataset, and serve the clustered model after stripping it of training variables.