How to work with a range of n-grams on Facebook FastText?-CodePudding

I am trying to train a text classifier with FastText. It has a bunch of options along with a facility to train from the command line. One of the options is wordNgrams.

In my particular dataset, I discovered that many irrelevant queries are being classified with high confidence because they share similar tokens. So, my plan was to ignore the unigram tokens and start from bigram. Now I go from 1st gram up to 5th gram by setting wordNgrams = 5, but my plan is to go from 2nd gram to 5th gram. But it seems to be that FastText doesn't support this. Is there any way to achieve this, this is required to minimize these False Positives.

CodePudding user response：

As far as I can tell, even though Facebook's fasttext lets users set a range for character-n-grams (subword information), using -minn & -maxn, it only offers a single -wordNgrams parameter setting the maximum length of word-multigrams.

However, it is also the case that the -supervised mode combines all the given tokens in an order-oblivious way. Thus, you could in your own preprocessing create whatever mix of n-grams (or other token-represented features) you'd like, then pass those to fasttext (which it would consider as all unigrams). As long as you apply the same preprocessing in training as in later classification, the effect should be the same.

(You could even use the sklearn CountVectorizer's preprocessing.)

Keep in mind the warning from ~Erwan in a comment, though: adding so many distinct features increases the risk of overfitting, which could show up as your stated problem "many irrelevant queries are being classified with high confidence because they share similar tokens". (The model, made large by the inclusion of so many n-grams, has memorized idiosyncratic minutia from the training-data. That then leads it astray applying non-generalizable inferences to out-of-training data.)