Remove underscores between words so they don't appear in n-grams in R-CodePudding

before running a topic model, I put n-grams, so words in 2-3 chunks could appear in my topic model afterward.

toks_data_ngrams <- tokens_ngrams(toks_data, n=2:3)

After this, however, my topic model includes so many words like a_b, apple_banana, happy_hand.

How can I ignore those words with underscores? I don't want them to be included in my topic model. Is there any extra code for ngrams so ngrams don't catch words with underscore in between? (I've already removed punctuations and symbols during the pre-processing).

Thanks so much for all your inputs in advance!

CodePudding user response：

You can exclude them using

toks_data_ngrams <- toks_data_ngrams[!grepl("_", toks_data_ngrams)]

In the future, always include reproducible examples in your questions

CodePudding user response：

tokens_ngrams has a concatenator option. By default this is set to _. You can specify anything you want, a space for example:

tokens_ngrams(toks_data, n= 2:3, concatenator = " ")