Memory efficient way to store a lot of very small arrays of integers in JavaScript-CodePudding

Edit: sorry I will update with the latest code later assume tmpRaw is generated. But anyway it get out of memory similarly. I guess this is unfeasible there is nothing to do with it. As any compression must be dealt with inside the tf-idf algorithm as anyway the whole dataset is referenced there.

By saying a lot of small arrays, I mean not storing an array of arrays but really independent references because I'm feeding a ML algorithm (tf-idf).

My data is of this sort:

[[1115, 585], [69], [5584, 59, 99], ...]

Practically I don't have the full data, but only a generator so each sub-array at a time.

I didn't do the complexity and memory calculations, but I'm not far from achieving the goal on a local machine (I'm processing 1/3 of data then running out of memory).

I tried the following:

[Int32Array.from([1115, 585]), ...] but it didn't help. Maybe typed arrays are not the best when you have small arrays.

Processing time doesn't matter a lot, as the algorithm runs pretty fast for now.

There might be a lot of repetitions of values, So I can build a dictionary but unfortunately there would be collisions with real values say for example:

1115 -> 1, 585 -> 2

But 1 and 2 might be allocated already in the real values fed to the tf-idf algorithm.

Finally I'm running on Node.

Edit

for (let index = 0; index < tmpRaw.length; index  ) {
    const nex = tmpRaw[index];
    // nex is like "hello world"
    try {
        // indexStrGetter simply turns "hello world" to [848,8993] using a prior dictionary
        doc = { id: co  , body: nex.map(indexStrGetter).filter(Boolean)}
        // or the following is nearly the same, (when the memory get exhausted)
      doc = { id: co  , body: Int32Array.from(nex.map(indexStrGetter)).filter(num => num > 0) }
        if (!doc.body.length) {
            co--
            continue
        }
      // BM TF-IDF is this implementation (modified a little: https://gist.github.com/alixaxel/4858240e0ca20802af43)
      bm.addDocument(doc)
    } catch (error) {
        console.log(error)
        console.log(nex)
    }
}

CodePudding user response：

If at any given moment you need to iterate over the outer array I think that the best way to get an efficiency improvement is to generate the array inside the loop, so when each iteration ends the variable is destroyed and a lot of memory will be saved.

CodePudding user response：

As I have thought of it, there nothing much to do to gain memory as data just needs that minimum.

All I can do is to build sub models (tf-idf) with chunks of data. With a bit less of learning accuracy.

if Node does efficiently store same strings in a single process like showed here: https://levelup.gitconnected.com/bytefish-vs-new-string-bytefish-what-is-the-difference-a795f6a7a08b

the last optimization would be to not convert words to numbers as ive been doing.