I need to measure neural network inference times for a project. I want my results presented to be aligned with the standard practices for measuring this in academic papers.
What I have managed to figure out is that we first want to warm up the GPU with a few inferences before the timing, and I need to use the torch provided timing feature (instead of Python's time.time()).
My questions are as follows:
- Is it standard to time with a batch size of 1, or with the best batch size for that hardware?
- Am I only timing the neural network inference, or am I also timing the moving of data to the GPU, as well as data transformations that precede inference?
- How many iterations would be reasonable to time to get a good average inference time?
Any advice would be greatly appreciated. Thank you.
CodePudding user response:
If you're concerned with inference time, batch size should be something to optimize for in the first place. Not all operations in a NN will be affected in the same way by a change in batch size (you could have not change thanks to parallelization, or linear change if all kernels are busy for instance). If you need to compare between models I'd optimize per model. If you don't want to do that then I'd use the train-time batch size. I think it would be unlikely that in production you'd have a batch size of 1, except if it does not fit in memory.
You should time both. If you're comparing models, data loading and transforms should not impact your decision but in a production environment it will matter a lot. So report both numbers, in some settings, scaling up the data-loading or the model may be easier than the other.
I would say around a 100. It's just a rule of thumb. You want your numbers to be statistically significant. You can also report the
std
in addition to the average, or even plot the distribution (percentiles or histograms or else)
You can also compare performance loss vs inference time gain when using half
float types for your data and model weights.