I am using app engine to serve a bunch of sklearn models. These models are around 100 mb in size, and there are around 25 of them.
Downloading them can take up to 15s at times, despite being in the designated app engine bucket, and is often dominating request times.
I currently use a FIFO cache layer wrapped around the GCS storage client, but cache hits aren't great as the different model are used quite interspersed and app engine memory is limited.
Memcache seems too small for this, and /tmp is also stored in RAM.
Is there a better solution for caching such files?
CodePudding user response:
You can imagine different solution to solve your issue.
- You can embed your models in your deployment. Like that, the model are already here with the service. When a new model version is released, you deployed a new app engine service revision
- The problem with the precedent solution is the deployment frequency: when one of the model is updated you need to repackage and redeploy your App Engine service. The solution is the micro services. You can have 1 model per APp Engine service and therefore only deploy this one that has been updated. If you want only entry point, you can have a 26th app engine service wich is your entry point and will route the request to the correct model service.
- You can also perform the same thing with Cloud Run, where you manage the container packaging and detail if you need special things. You have also more flexibility on the number of CPUs and the memory size.
Last point, after solving the download issue part, you could have cold start issue: the time that take your server to start and to load in memory your model (at the first request, when the instance start). Cloud Run proposes a min-instance feature to keep warm a certain number of instances and therefore to eliminate the cold start issue.