I am trying to train a text classifier with FastText. It has a bunch of options along with a facility to train from the command line. One of the options is wordNgrams
.
In my particular dataset, I discovered that many irrelevant queries are being classified with high confidence because they share similar tokens. So, my plan was to ignore the unigram
tokens and start from bigram
. Now I go from 1st gram up to 5th gram by setting wordNgrams = 5
, but my plan is to go from 2nd gram to 5th gram. But it seems to be that FastText
doesn't support this. Is there any way to achieve this, this is required to minimize these False Positives
.
CodePudding user response:
As far as I can tell, even though Facebook's fasttext
lets users set a range for character-n-grams (subword information), using -minn
& -maxn
, it only offers a single -wordNgrams
parameter setting the maximum length of word-multigrams.
However, it is also the case that the -supervised
mode combines all the given tokens in an order-oblivious way. Thus, you could in your own preprocessing create whatever mix of n-grams (or other token-represented features) you'd like, then pass those to fasttext
(which it would consider as all unigrams). As long as you apply the same preprocessing in training as in later classification, the effect should be the same.
(You could even use the sklearn
CountVectorizer
's preprocessing.)
Keep in mind the warning from ~Erwan in a comment, though: adding so many distinct features increases the risk of overfitting, which could show up as your stated problem "many irrelevant queries are being classified with high confidence because they share similar tokens". (The model, made large by the inclusion of so many n-grams, has memorized idiosyncratic minutia from the training-data. That then leads it astray applying non-generalizable inferences to out-of-training data.)