Edit: sorry I will update with the latest code later assume tmpRaw is generated. But anyway it get out of memory similarly. I guess this is unfeasible there is nothing to do with it. As any compression must be dealt with inside the tf-idf algorithm as anyway the whole dataset is referenced there.
By saying a lot of small arrays, I mean not storing an array of arrays but really independent references because I'm feeding a ML algorithm (tf-idf).
My data is of this sort:
[[1115, 585], [69], [5584, 59, 99], ...]
Practically I don't have the full data, but only a generator so each sub-array at a time.
I didn't do the complexity and memory calculations, but I'm not far from achieving the goal on a local machine (I'm processing 1/3 of data then running out of memory).
I tried the following:
[Int32Array.from([1115, 585]), ...]
but it didn't help. Maybe typed arrays are not the best when you have small arrays.
Processing time doesn't matter a lot, as the algorithm runs pretty fast for now.
There might be a lot of repetitions of values, So I can build a dictionary but unfortunately there would be collisions with real values say for example:
1115 -> 1, 585 -> 2
But 1 and 2 might be allocated already in the real values fed to the tf-idf algorithm.
Finally I'm running on Node.
Edit
for (let index = 0; index < tmpRaw.length; index ) {
const nex = tmpRaw[index];
// nex is like "hello world"
try {
// indexStrGetter simply turns "hello world" to [848,8993] using a prior dictionary
doc = { id: co , body: nex.map(indexStrGetter).filter(Boolean)}
// or the following is nearly the same, (when the memory get exhausted)
doc = { id: co , body: Int32Array.from(nex.map(indexStrGetter)).filter(num => num > 0) }
if (!doc.body.length) {
co--
continue
}
// BM TF-IDF is this implementation (modified a little: https://gist.github.com/alixaxel/4858240e0ca20802af43)
bm.addDocument(doc)
} catch (error) {
console.log(error)
console.log(nex)
}
}
CodePudding user response:
If at any given moment you need to iterate over the outer array I think that the best way to get an efficiency improvement is to generate the array inside the loop, so when each iteration ends the variable is destroyed and a lot of memory will be saved.
CodePudding user response:
As I have thought of it, there nothing much to do to gain memory as data just needs that minimum.
All I can do is to build sub models (tf-idf) with chunks of data. With a bit less of learning accuracy.
if Node does efficiently store same strings in a single process like showed here: https://levelup.gitconnected.com/bytefish-vs-new-string-bytefish-what-is-the-difference-a795f6a7a08b
the last optimization would be to not convert words to numbers as ive been doing.