Home > other >  Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization
Remove Words with less than Certain Character Lengths plus Noise Reduction before Tokenization

Time:04-23

I have the following data frame

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

From a previous coding help, we can remove stop words using the following code.

report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

I want to remove words less than certain character length (for example, want to remove words less than 4 characters such as hei and hey). Plus need to remove manual stop words (for example, saw and kitty) and common noises (whitespaces, numbers, and punctuations) before tokenization. The final outcome would be:

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                                   wood  4
5                             hello best  5

A similar question regarding noise and manual stop words is posted here.

CodePudding user response:

With the previous code, if we start with removal of words that have nchar less than or equal to 3 (with gsubfn) it should work

trimws(gsub(paste0("\\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\\b"), "", 
     gsub("[[:punct:]0-9] ", "",gsubfn("\\w ", function(x) 
     if(nchar(x) > 3) x else '', report$Text))))))

-output

[1] "unit crosses street"    "driver speeding driver" 
[3] "year year pandemic"     "wood"                   "hello best"       
  • Related