Home > OS >  How to catch any words in TfidfVectorizer by token_pattern
How to catch any words in TfidfVectorizer by token_pattern

Time:11-20

I'd like to catch any words separated by just space in TfidfVectorizer, even if the words like "0" "a" "x" "0?0" and so on. I wrote the below code for this purpose.

However, maybe, this code doesn't work well.

vectorizer = TfidfVectorizer(smooth_idf = False, token_pattern=r"[^ ] ")

CodePudding user response:

You may be looking for word boundaries:

\b\S \b

Explanation:

  • \b looks for a word boundary, in the first instance of usage it will look for the start of a word (first words after a newline or anything after a space (or type of whitespace))
  • \S matches non whitespace characters at least once (the word you are looking for)
  • Second \b matches end of word matched

Usage:

For string: Greetings from Spain it'd match Greetings , from and Spain

  • Related