Home > Blockchain >  How do I clean those duplicates values that has different spellings in r?
How do I clean those duplicates values that has different spellings in r?

Time:11-02

How do I remove those duplicates that has different spellings?

Example table with duplicates

Name Score
Abi 12
Abby 12
Aby 12
Toom 4
Tom 4
Tm 4
Crow 9

result I am looking for

Name Score
Abby 12
Tom 9
Crow 4
name <- c('Abi', 'Abby', 'Aby', 'Toom', 'Tom', 'Tm', 'Crow')
score <- c(12,12,12,4,4,4,9)
duplicate <- data.frame(name,score)

CodePudding user response:

Try

library(dplyr)
library(phonics)
keyname <-  c("Abby", "Tom", "Crow")
 duplicate %>%
    mutate(name2 = keyname[match(name, keyname)]) %>% 
    group_by(grp = soundex(name)) %>%
    mutate(name = name2[!is.na(name2)]) %>%
    ungroup %>% 
    distinct(name, score)

-output

# A tibble: 3 × 2
  name  score
  <chr> <dbl>
1 Abby     12
2 Tom       4
3 Crow      9

CodePudding user response:

With adist, you can group by string similarity:

nm <- c("Abby", "Tom", "Crow")
duplicate |>
  transform(name = nm[apply(adist(name, nm), 1, which.min)]) |>
  aggregate(score ~ name, FUN = mean)

  name score
1 Abby    12
2  Tom     4
3 Crow     9
  •  Tags:  
  • r
  • Related