How to catch any words in TfidfVectorizer by token

I'd like to catch any words separated by just space in TfidfVectorizer, even if the words like "0" "a" "x" "0?0" and so on. I wrote the below code for this purpose.

However, maybe, this code doesn't work well.

vectorizer = TfidfVectorizer(smooth_idf = False, token_pattern=r"[^ ] ")

CodePudding user response：

You may be looking for word boundaries:

\b\S \b

Explanation:

\b looks for a word boundary, in the first instance of usage it will look for the start of a word (first words after a newline or anything after a space (or type of whitespace))
\S matches non whitespace characters at least once (the word you are looking for)
Second \b matches end of word matched

Usage:

For string: Greetings from Spain it'd match Greetings , from and Spain