How do I remove those duplicates that has different spellings?
Example table with duplicates
Name | Score |
---|---|
Abi | 12 |
Abby | 12 |
Aby | 12 |
Toom | 4 |
Tom | 4 |
Tm | 4 |
Crow | 9 |
result I am looking for
Name | Score |
---|---|
Abby | 12 |
Tom | 9 |
Crow | 4 |
name <- c('Abi', 'Abby', 'Aby', 'Toom', 'Tom', 'Tm', 'Crow')
score <- c(12,12,12,4,4,4,9)
duplicate <- data.frame(name,score)
CodePudding user response:
Try
library(dplyr)
library(phonics)
keyname <- c("Abby", "Tom", "Crow")
duplicate %>%
mutate(name2 = keyname[match(name, keyname)]) %>%
group_by(grp = soundex(name)) %>%
mutate(name = name2[!is.na(name2)]) %>%
ungroup %>%
distinct(name, score)
-output
# A tibble: 3 × 2
name score
<chr> <dbl>
1 Abby 12
2 Tom 4
3 Crow 9
CodePudding user response:
With adist
, you can group by string similarity:
nm <- c("Abby", "Tom", "Crow")
duplicate |>
transform(name = nm[apply(adist(name, nm), 1, which.min)]) |>
aggregate(score ~ name, FUN = mean)
name score
1 Abby 12
2 Tom 4
3 Crow 9