I have the following data frame
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
From a previous coding help, we can remove stop words using the following code.
report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
I want to remove words less than certain character length (for example, want to remove words less than 4 characters such as hei
and hey
). Plus need to remove manual stop words (for example, saw
and kitty
) and common noises (whitespaces, numbers, and punctuations) before tokenization. The final outcome would be:
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 wood 4
5 hello best 5
A similar question regarding noise and manual stop words is posted here.
CodePudding user response:
With the previous code, if we start with removal of words that have nchar
less than or equal to 3 (with gsubfn
) it should work
trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\\b"), "",
gsub("[[:punct:]0-9] ", "",gsubfn("\\w ", function(x)
if(nchar(x) > 3) x else '', report$Text))))))
-output
[1] "unit crosses street" "driver speeding driver"
[3] "year year pandemic" "wood" "hello best"