I'm thinking of deploying a TensorFlow model using Vertex AI in GCP. I am almost sure that the cost will be directly related to the number of queries per second (QPS) because I am going to use automatic scaling. I also know that the type of machine (with GPU, TPU, etc.) will have an impact on the cost.
- Do you have any estimation about the cost versus the number of queries per second?
- How does the type of virtual machine changes this cost?
The type of model is for object detection.
CodePudding user response:
Autoscaling depends on the CPU and GPU utilization which directly correlates to the QPS, as you have said. To estimate the cost based on the QPS, you can deploy a custom prediction container to a Compute Engine instance directly, then benchmark the instance by making prediction calls until the VM hits 90 percent CPU utilization (consider GPU utilization if configured). Do this multiple times for different machine types, and determine the "QPS per cost per hour" of different machine types. You can re-run these experiments while benchmarking latency to find the ideal cost per QPS per your latency targets for your specific custom prediction container. For more information about choosing the ideal machine for your workload, refer to this documentation.
For your second question, as per the Vertex AI pricing documentation (for model deployment), cost estimation is done based on the node hours. A node hour represents the time a virtual machine spends running your prediction job or waiting in a ready state to handle prediction or explanation requests. Each type of VM offered has a specific pricing per node hour depending on the number of cores and the amount of memory. Using a VM with more resources will cost more per node hour and vice versa. To choose an ideal VM for your deployment, please follow the steps given in the first paragraph which will help you find a good trade off between cost and performance.