How to remove underscores from a text in Quanteda Tokens in R-CodePudding

EDIT See EDIT below

I'm trying to convert a corpus object to tokens using R and Quanteda. Using the options in token() I cannot seem to remove the underscores in some words/characters. When I try using stri_replace_all_regex() the characters completely disappear.

The following code gives the below output

CODE

dirty_corpus <- corpus(textdata)

toks <- dirty_corpus %>%
  stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
  tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
         remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
  tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
  tokens_wordstem() %>%
  tokens_tolower()

Output:

text6 : [1] "ys" "s_" "_t" "_s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"

I want the following output:

text6 : [1] "ys" "s" "t" "s" "sw" "lnk" "smn" "pstd" "dwn" "blw" [11] "srri"

When I chain:

stringi::stri_replace_all_regex("_", "") %>%

resuling into the code:

dirty_corpus <- corpus(textdata)

toks <- dirty_corpus %>%
  stringi::stri_replace_all_regex("\'[a-z]*", "") %>%
  stringi::stri_replace_all_regex("_", "") %>%
  tokens(what = "word", remove_punct = TRUE, preserve_tags = FALSE, remove_numbers = TRUE, remove_separators = TRUE,
         remove_url = TRUE, split_hyphens = TRUE, remove_symbols = TRUE, split_tags = TRUE, verbose = TRUE) %>%
  tokens_remove(pattern = phrase(english_stopwords), valuetype = 'fixed') %>%
  tokens_wordstem() %>%
  tokens_tolower()

The output becomes the following:

text6 : [1] "ys" "sw" "lnk" "smn" "pstd" "dwn" "blw" "srri"

Making the characters previously contain the underscore disappear.

How can I obtain the result I intend?

EDIT In hindsight everything was performing exactly as planned! Since I didn't write all of the code myself I did not realize the characters being removed were in the stopwords list, hence they were being removed! Stupid!

CodePudding user response：

You can use types() and tokens_replace() to modify your tokens using stringi functions.

require(quanteda)
txt <- "_t ys s_ _t _s sw lnk _s"
toks <- tokens(txt)
print(toks)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "_t"  "ys"  "s_"  "_t"  "_s"  "sw"  "lnk" "_s"
toks_rep <- tokens_replace(toks, 
                           pattern = types(toks), 
                           replacement = stringi::stri_replace_all_fixed(types(toks), "_", ""),
                           valuetype = "fixed")
print(toks_rep)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "t"   "ys"  "s"   "t"   "s"   "sw"  "lnk" "s"

CodePudding user response：

In hindsight everything was performing exactly as planned! Since I didn't write all of the code myself I did not realize the characters being removed were in the stopwords list, hence they were being removed! Stupid!