Home > Mobile >  Tidytext R - find and replace
Tidytext R - find and replace

Time:12-18

I have the results from a survey, in which a bunch of anwsers have errors, such as misspellings, UppercAseS/lower cases, ...

Therefore, I need something like a find and replace kind of solution (I've found some possible functions but none of them seemed to work. I am kind of a no0b)

...but instead of finding and replacing one by one, I would like to create a vector (?) of "mistakes" and then replace them with the correct answer, tidying my text for later being able to visualize the results.

I tried this

Consider VAR1 as the awnsers:

VAR1 <- c("motorbyke","motor bike","Mbike","Motor   B","Motor","Bike")

And I would like to have a change the misspelled awnsers to a correct one; let's say "motorbike"...

DB %>% 
mutate(VAR1 = replace(VAR1, VAR1 == "misspelling", "correct answer")) 

but there are too many errors for doing it individually...

Is there any solution for my dilema?

Thank you

EDIT: tried do add an example

CodePudding user response:

Here's one possible solution using the tidyverse and left_joins:

DB <- data.frame(
  VAR1=c(c("motorbyke","motor bike","Mbike","Motor   B","Motor","Bike"), 
         sample(stringr::words, 10)))

correction_df <- data.frame(
  cbind(correction="motorbike", incorrect=c("motorbyke","motor bike","Mbike","Motor   B","Motor","Bike"))
)

DB %>%
  left_join(correction_df, by=c(VAR1="incorrect")) %>%
  mutate(VAR1=ifelse(is.na(correction), VAR1, correction)) %>%
  select(-correction)

where new entries can be added to correction_df with the syntax provided. Alternatively, the fuzzyjoin package does something very similar and might automate some of the corrections you're interested in.

CodePudding user response:

You could create a pattern for str_replace of your vector and then replace all of these with motorbike (in column or vector etc....)

VAR1 <- c("motorbyke","motor bike","Mbike","Motor   B","Motor","Bike")

my_pattern <- paste(VAR1, collapse = "|")


library(stringr)
str_replace(VAR1, my_pattern, 'motorbike')

output:

[1] "motorbike" "motorbike" "motorbike" "motorbike" "motorbike" "motorbike"
  • Related