I'm using Onnxruntime in NodeJS to execute onnx converted models in cpu backend. I run model inference in parallel using Promise.allSettled:
var promises = sequences.map(seq => self.inference(self.session, self.tokenizer, seq));
results = (await Promise.allSettled(promises)).filter(p => p.status === "fulfilled").map(p => p.value);
running this class instance method inference
that calls a static method Util.performance.now
ONNX.prototype.inference = async function (session, tokenizer, text) {
const default_labels = this._options.model.default_labels;
const labels = this._options.model.labels;
const debug = this._options.debug;
try {
const encoded_ids = await tokenizer.tokenize(text);
if (encoded_ids.length === 0) {
return [0.0, default_labels];
}
const model_input = ONNX.create_model_input(encoded_ids);
const start = Util.performance.now();
const output = await session.run(model_input, ['output_0']);
const duration = Util.performance.now(start).toFixed(1);
const sequence_length = model_input['input_ids'].size;
if (debug) console.log("latency = " duration "ms, sequence_length=" sequence_length);
const probs = output['output_0'].data.map(ONNX.sigmoid).map(t => Math.floor(t * 100));
const result = [];
for (var i = 0; i < labels.length; i ) {
const t = [labels[i], probs[i]];
result[i] = t;
}
result.sort(ONNX.sortResult);
const result_list = [];
for (i = 0; i < 6; i ) {
result_list[i] = result[i];
}
return [parseFloat(duration), result_list];
} catch (e) {
return [0.0, default_labels];
}
}//inference
the timing is wrong and summed up.
The performance
object looks like
Util = {
performance: {
now: function (start) {
if (!start) {
return process.hrtime();
}
var end = process.hrtime(start);
return Math.round((end[0] * 1000) (end[1] / 1000000));
}
}
}
and it used in the usual way
// this runs parallel
const start = Util.performance.now();
// computation
const duration = (Util.performance.now() - start).toFixed(1);
Now, within the performance
fun the start
and end
variables scope is local, so what happens using Promise.allSettled
? I would expect that timing would be correct due to the local scope.
CodePudding user response:
The timing mechanics are correct, but when session.run
is called, it will initiate some asynchronous API (non-JavaScript) while already returning the pending promise object. This will allow other executions of inference
to call session.run
as well, leading to a state where the asynchronous API is dealing with several such requests concurrently, and so the same slices of time are counted in multiple execution contexts of inference
. These requests may even end at moments in time that are quite close. When that happens, one after the other executions of inference
resume with the code that ends the timing (setting their duration
variables). It is clear that these durations can and probably will overlap each other.
To visualise it for 3 executions of inference
, you could have this:
start-----------------------------start duration
start-----------------------------start duration
start------------------------------start duration
time ----->
If you don't want this parallelism, you should not make all calls of inference
at almost the same time, but wait with each next execution until the previous one is resolved:
for (let seq of sequences) {
let value = await self.inference(self.session, self.tokenizer, seq));
// ...
}
This way executions of the asynchronous part of session.run
will not overlap, and will not suffer performance because of that concurrency. I would expect timings to be more like this:
start---------------start duration
start-------------start duration
start------------start duration
time ----->
Now timeslices will not be counted more than once, even though the total duration of the whole process will likely be longer.
Note that the behaviour has nothing to to with Promise.allSettled
, because you'll get those outputs anyway, even if you remove that Promise.allSettled
call from your program.