before running a topic model, I put n-grams, so words in 2-3 chunks could appear in my topic model afterward.
toks_data_ngrams <- tokens_ngrams(toks_data, n=2:3)
After this, however, my topic model includes so many words like a_b, apple_banana, happy_hand.
How can I ignore those words with underscores? I don't want them to be included in my topic model. Is there any extra code for ngrams so ngrams don't catch words with underscore in between? (I've already removed punctuations and symbols during the pre-processing).
Thanks so much for all your inputs in advance!
CodePudding user response:
You can exclude them using
toks_data_ngrams <- toks_data_ngrams[!grepl("_", toks_data_ngrams)]
In the future, always include reproducible examples in your questions
CodePudding user response:
tokens_ngrams
has a concatenator option. By default this is set to _
. You can specify anything you want, a space for example:
tokens_ngrams(toks_data, n= 2:3, concatenator = " ")