What is the ideal "size" of the vector for each word in Word2Vec?-CodePudding

I have a dataset of over 1 million rows. Each row has 40 token words. Based on these tokens, a classification is made with a neural network. The vocabulary is 20,000 unique words. I set the size (dimension) of the vectors in gensim Word2Vec as 150 and saved these vectors for each data point in a json file. The json file's size is really huge: 250 GB. I cannot load this file into memory in one scoop as my RAM is only 128 GB. I am trying to see if I can reduce the physical size of these vectors by reducing them to the right size. I went through some of the suggestions made in this website such as Relation between Word2Vec vector size and total number of words scanned?. But the vector size is mentioned to be 100-300 and also depends on the problem.

How can I determine the right size (dimension) of the vector for each token based on the given problem?
I use skip-gram model to train Word2Vec. Should I use CBOW to optimize the size?
Is json the fight format to store/retrieve these vectors or are there other efficient ways?

CodePudding user response：

Are your word embeddings for each unique token static or context-dependent?

If they are static (always the same for the exact same word), it would make sense to get all unique tokens once and assign each token an integer ID (mapping id2string and string2id). There also are Gensim functions to do that. Then get the vector of length 150 for each unique string token and put it in something like a dictionary (so mapping id2vector or string2vector).

Then you just have to save the ids for each row (assuming it is fine for you to discard all out-of-vocabulary tokens) as well as the mapping dict. So instead of saving ~40m strings (the texts) and ~6b (40m * 150 vector size) floats you just have to save ~20k strings 3m floats (20k * 150) and ~40m ints which should take drastically less space.

As for the dimension of the vector: I would expect the performance to be better with larger vectors (e.g. 300 is probably better than 100, if you have enough training data to train such detailed embeddings), so if you are looking for performance instead of optimizing efficiency I'd try to keep the vector size not too small.

CodePudding user response：

Each dimension of a dense vector is typically a 32-bit float.

So, storing 20,000 token-vectors of 150 dimensions each will take at least 20000 vectors * 150 floats * 4 bytes/float = 12MB for the raw vector weights, plus some overhead for remembering which token associates with which line.

Let's say you were somehow changing each of your rows into a single summary vector of the same size. (Perhaps, by averaging together each of the ~40 token vectors into a single vector – a simple baseline approach, though there are many limitations of that approach, & other techniques that might be used.) In that case, storing the 1 million vectors will necessarily take about 1000000 vectors * 150 floats * 4 bytes/float = 600MB for the raw vector weights, plus some overhead to remember which row associates with which vector.

That neither of these is anywhere close to 250GB implies you're making some other choices expanding things significantly. JSON is a poor choice for compactly representing dense floating-point numerical data, but even that is unlikely to explain the full expansion.

Your description that you "saved these vectors for each data point in a json file" isn't really clear what vectors are being saved, or in what sort of JSON conventions.

Perhaps you're storing the 40 separate vectors for each row? That'd give a raw baseline weights-only size of 1000000 rows * 40 tokens/row * 150 floats * 4 bytes/float = 24GB. It is plausible inefficient JSON is expanding the stored-size by something like 10x, so I guess you're doing something like this.

But in addition to the inefficiency of JSON, given that the 40 tokens (from a vocabulary of 20k) each given enough info to reconstitute any other per-row info that's solely a function of the tokens & the 20k word-vectors, there's not really any reason to expand the representations this way.

For example: If the word 'apple' is already in your dataset, and appears many thousands of times, there's no reason to re-write the 150 dimensions of 'apple' many thousands of times. The word 'apple' alone is enough to call-back those 150 dimensions, whenever you want them, from the much-smaller (12MB) set-of-20k token-vectors, that's easy to keep in RAM.

So mainly: ditch JSON, don't (unnecessarily) expand each row into 40 * 150 dimensions.

To your specific questions:

The optimal size will vary based on lots of things, including your data & other goals. The only way to rationally choose is to figure out some way to score the trained vectors on your true end goals: some repeatable way of comparing multiple alternate choices of vector_size. Then you run it every plausible way & pick the best. (Barring that, you take a random stab based on some precedent work that seems roughly similar in data/goals, and hope that works OK until you have the chance to compare it against other choices.)
The choice of skip-gram or CBOW wont affect the size of the model at all. It might affect end result quality & training times a bit, but the only way to choose is to try both & see which works better for your goals & constraints.
JSON is an awful choice for storing dense binary data. Representing numbers as just text involves expansion. The JSON formatting characters add extra overhead, repeatedly on every tow, that's redundant if every row is the exact same 'shape' of raw data. And, typical later vector operations in RAM usually work best on the same sort of maximally-compact raw in-memory representation that would also be best on disk. (In fat, the best on-disk representation will often be data that exactly matches the format in memory, so that data can be "memory-mapped" from disk to RAM in a quick direct operation that minimizes format-wrangling & even defers accesses until reads needed.)

Gensim will efficiently save its models via their build-in .save() method, into one (or more often several) related files on disk. If your Gensim Word2Vec model is in the variable w2v_model, you can just save the whole model with w2v_model.save(YOUR_FILENAME) – & later reload it with Word2Vec.load(YOUR_FILENAME).

But if after training you only need the (20k) word-vectors, you can just save the w2v_model.wv property – just the vectors: w2v_model.wv.save(YOUR_FILENAME). Then you can reload them as an instance of KeyedVectors: KeyedVectors.load(YOUR_FILENAME).

(Note in all cases: the save may be spread over multiple files, which if ever copied/moved elsewhere, should be kept together - even though you only ever specify the 'root' file of each set in save/load operations.)

How & whether you'd want to store any vectorization of your 1 million rows would depend on other things not yet specified, including the character of the data, and the kinds of classification applied later. I doubt that you want to turn your rows into 40 * 150 6000-dimensions – that'd be counter to some of the usual intended benefits of a Word2Vec-based analysis, where the word apple has much the same significance no matter where it appears in a textual list-of-words.

You'd have to say more about your data, & classification goals, to get a better recommendation here.

If you haven't already done a 'bag-of-words' style representation of your rows (no word2vec), where every row is represented by a 20000-dimension one-hot sparse vector, and run a classifier on that, I'd recommend that first, as a baseline.