Home > Mobile >  Training fasttext word embedding on your own corpus
Training fasttext word embedding on your own corpus

Time:10-19

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

From the tutorials I read, I need list of words for each sentences but what if I have list of words from an article? What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

CodePudding user response:

FastText requires text as its training data - not anything that's pre-vectorized, as if by TfidfVectorizer. (If that's part of your FastText process, it's misplaced.)

The Gensim FastText support requires the training corpus as a Python iterable, where each item is a list of string word-tokens.

Each list-of-tokens is typically some cohesive text, where the neighboring words have the relationship of usage together in usual natural-language. It might be a sentence, a paragraph, a post, an article/chapter, or whatever. Gensim's only limitation is that each text shouldn't be more than 10,000 tokens long. (If your texts are longer than that, they should be fragmented into separate 10,000-or-fewer parts. But don't worry too much about the loss of association around the split points - in training sets sufficiently large for an algorithm like FastText, any such loss-of-contexts is negligible.)

  • Related