Home > Blockchain >  R - In a character string, find a word given in first column of df and replace it by its correspondi
R - In a character string, find a word given in first column of df and replace it by its correspondi

Time:10-01

I am working on messy character strings (from OCR) and want to correct it. Let's say I have a string like this :

messy_string <- c("This is a long string with m-istakes I want to corect")

I exported a list of all words contained in this string so that I could manually give each word its replacement, like that :

Raw word New form
m-istakes mistake
corect correct

Then I get the above dataframe with two columns : "Raw word" is the pattern I want to match and replace, "New Form" is by what I want to replace it.

I guess the solution is obvious, but I could not find a working code that would say : take "messy_string", parse it to find any form given in the dataframe first column, and replace it by the form in the dataframe second column.

Would you have any idea to make it work ? Thanks a lot !

CodePudding user response:

Here's a very hacky option:

library(tidyverse)

# original string
messy_string <- c("This is a long string with m-istakes I want to corect")

# table of fixes
fix_table <- tibble(
  "Raw word" = c("m-istakes", "corect"),
  "New form" = c("mistakes", "correct")
)

# split your sentence by space " "
my_words <- messy_string %>% 
  str_split(" ") 

# replace words in a list
my_corrected_words <- map_chr(1:length(my_words[[1]]), function(word) {
  
  if (my_words[[1]][[word]] %in% fix_table$`Raw word`) {
    my_words[[1]][[word]] <- fix_table %>% 
      filter(`Raw word` == my_words[[1]][[word]]) %>% 
      pull(`New form`)
  } else {
    my_words[[1]][[word]]
  }
  
})

# turn separated characters back into sentence
fixed <- paste(my_corrected_words, collapse = " ")
fixed
#> [1] "This is a long string with mistakes I want to correct"

CodePudding user response:

A base R option is using Reduce gsub

> Reduce(function(x,k) with(df,gsub(RawWord[k],NewForm[k],x,fixed = TRUE)),1:nrow(df),init = messy_string)
[1] "This is a long string with mistake I want to correct"

where

df <- structure(list(RawWord = c("m-istakes", "corect"), NewForm = c("mistake", 
"correct")), class = "data.frame", row.names = c(NA, -2L))
  • Related