I am trying to continue training a fastText model with Gensim, using my own corpus of text.
I've followed along with the documentation here: https://radimrehurek.com/gensim/models/fasttext.html
And I have written the following code:
First, create a small corpus:
corpus = [
"The brown dog jumps over the kangaroo",
"I want to ride my bicycle to Mount Everest",
"What a lovely day it is",
"When I Wagagamagga, everybody stops to listen"
]
corpus = [sentence.split() for sentence in corpus]
And then load a testing model:
from gensim.models.fasttext import load_facebook_model
from gensim.test.utils import datapath
model = load_facebook_model(datapath("crime-and-punishment.bin"))
Then I do a check to see if the model knows my weird new word in the corpus:
'Wagagamagga' in model.wv.key_to_index
Which returns False.
Then I try to continue the training:
model.build_vocab(corpus, update=True)
model.train(corpus, total_examples=len(corpus), epochs=model.epochs)
The model should know about my weird new word now, but this returns False, when I am expecting it to return True:
'Wagagamagga' in model.wv.key_to_index
What have I missed?
CodePudding user response:
Models generally have a min_count
value of at least 5 - meaning words with fewer occurrences are ignored. Discarding the rarest words typically improves model quality, as both:
- such rare words have too few usage examples to get a good vector themselves; and further…
- by pushing surrounding words outside each others' windows, and spending training cycles & internal-weight updates on a vector that still won't be good, they make other word-vectors worse
With larger training data, increasing the min_count
even higher makes sense.
So, your problem is likely because a single occurence of that word is insufficient to make it a tracked word. Using a larger, varied corpus with multiple contrasting usage examples, at least as many as the model.min_count
value, would be the best fix.
Separately: note that it is always better to train a model with all data at the same time.
Incremental updates will execute, but introduce thorny issues of relative balance between older & newer sessions. To the extent a new session uses only a subset of words and representative word-usages, those words-included can be nudged by training out of comparable alignment with words only known in earlier sessions.
So if trying incremental updates, make sure your quality-evaluations are strong enough to detect if the model is actually improving, or gettings worse, on your real goals.