Remove Stop words from multi-lingual Text-CodePudding

I am running textual and sentiment analysis on multi-lingual text files from the healthcare sector, and I want to remove stopwords from all the languages at once. I don't want to write the name of every language in the code to remove the stopwords. Is there any way I can do it fast?

Here is my code: The total number of files is 596

files = list.files(path = getwd(), pattern = "txt", all.files = FALSE,
                   full.names = TRUE, recursive = TRUE)
txt = {}
for (i in 1:596) 
  try( 
    {
      txt[[i]] <- readLines(files[i], warn = FALSE) 
  
  filename <- txt[[i]]
  filename <- trimws(filename)
  corpus <- iconv(filename, to = "utf-8")
  corpus <- Corpus(VectorSource(corpus))
  
  # Clean Text
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  cleanset <- tm_map(corpus, removeWords, stopwords("english"))
  cleanset <- tm_map(cleanset, removeWords, stopwords("spanish"))
  cleanset <- tm_map(cleanset, content_transformer(tolower))
  cleanset <- tm_map(cleanset, stripWhitespace)
  
  # Remove spaces and newlines
  cleanset <- tm_map("\n", " ", cleanset)
  cleanset <- tm_map("^\\s ", "", cleanset)
  cleanset <- tm_map("\\s $", "", cleanset)
  cleanset <- tm_map("[ |\t] ", " ", cleanset)

  }, silent = TRUE)

CodePudding user response：

Use spacy where it has more than 15 language models with stopwords. For R language spacyr.

CodePudding user response：

I want to remove stopwords from all the languages at once.

Merge the results of each stopwords(cc) call, and pass that to a single tm_map(corpus, removeWords, allStopwords) call.

I don't want to write the name of every language in the code to remove the stopwords

You could use stopwords_getlanguages() to get a list of all the supported languages, and do it as a loop. See an example at https://www.rdocumentation.org/packages/stopwords/versions/2.3

For what its worth, I think this (using stopwords of all languages) is a bad idea. What is a stop word in one language could be a high information word in another language. E.g. just skimming https://github.com/stopwords-iso/stopwords-es/blob/master/stopwords-es.txt I spotted "embargo", "final", "mayor", "salvo", "sea", which are not in the English stopword list, and could carry information.

Of course it depends on what you are doing with the data once all these words have been stripped out.

But if something like searching for drug names, or other keywords, just do that on the original data, without removing stopwords.