I am working with R and have a data frame that includes names and corresponding office addresses. But the addresses are not perfectly matched with names and I have to replace the wrong addresses based on the most frequent value. Here is an example:
ID Clients ADDRESS FIRST_NAME LAST_NAME
1 A Cobben Cui Z
2 B Beerstaat Cui Z
3 C River Cui Z
4 D Cobben Cui Z
5 E Cobben Cui Z
6 F Cobben Cui Z
6 F Hart Mike G
7 H Hart Mike G
8 I Sparren Mike G
And I would like a data frame that looks like this:
ID Clients ADDRESS FIRST_NAME LAST_NAME
1 A Cobben Cui Z
2 B Cobben Cui Z
3 C Cobben Cui Z
4 D Cobben Cui Z
5 E Cobben Cui Z
6 F Cobben Cui Z
6 F Hart Mike G
7 H Hart Mike G
8 I Hart Mike G
I think it can be done by creating a frequency table and I was wondering whether there is a less complicated method.
Thanks a lot!
CodePudding user response:
A dplyr
approach using add_count
and which.max
may look like so:
library(dplyr)
d %>%
add_count(LAST_NAME, ADDRESS) %>%
group_by(LAST_NAME) %>%
mutate(ADDRESS = ADDRESS[which.max(n)]) %>%
ungroup() %>%
select(-n)
#> # A tibble: 9 × 5
#> ID Clients ADDRESS FIRST_NAME LAST_NAME
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 A Cobben Cui Z
#> 2 2 B Cobben Cui Z
#> 3 3 C Cobben Cui Z
#> 4 4 D Cobben Cui Z
#> 5 5 E Cobben Cui Z
#> 6 6 F Cobben Cui Z
#> 7 6 F Hart Mike G
#> 8 7 H Hart Mike G
#> 9 8 I Hart Mike G
DATA
d <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L), Clients = c(
"A",
"B", "C", "D", "E", "F", "F", "H", "I"
), ADDRESS = c(
"Cobben",
"Beerstaat", "River", "Cobben", "Cobben", "Cobben", "Hart", "Hart",
"Sparren"
), FIRST_NAME = c(
"Cui", "Cui", "Cui", "Cui", "Cui",
"Cui", "Mike", "Mike", "Mike"
), LAST_NAME = c(
"Z", "Z", "Z",
"Z", "Z", "Z", "G", "G", "G"
)), class = "data.frame", row.names = c(
NA,
-9L
))
CodePudding user response:
Not that easy. I like stefan's version. Here is a possible alternative:
df %>%
group_by(ADDRESS) %>%
summarise(n = n()) %>%
left_join(df, by = c("ADDRESS")) %>%
group_by(FIRST_NAME, LAST_NAME) %>%
arrange(desc(n), .by_group = TRUE) %>%
mutate(ADDRESS = first(ADDRESS)) %>%
select(-n)
ADDRESS ID Clients FIRST_NAME LAST_NAME
<chr> <int> <chr> <chr> <chr>
1 Cobben 1 A Cui Z
2 Cobben 4 D Cui Z
3 Cobben 5 E Cui Z
4 Cobben 6 F Cui Z
5 Cobben 2 B Cui Z
6 Cobben 3 C Cui Z
7 Hart 6 F Mike G
8 Hart 7 H Mike G
9 Hart 8 I Mike G
CodePudding user response:
We may use base R
- use the Mode
on 'ADDRESS' grouping by 'FIRST_NAME', 'LAST_NAME' in ave
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df1$ADDRESS <- with(df1, ave(ADDRESS, FIRST_NAME,LAST_NAME, FUN = Mode))
-output
> df1
ID Clients ADDRESS FIRST_NAME LAST_NAME
1 1 A Cobben Cui Z
2 2 B Cobben Cui Z
3 3 C Cobben Cui Z
4 4 D Cobben Cui Z
5 5 E Cobben Cui Z
6 6 F Cobben Cui Z
7 6 F Hart Mike G
8 7 H Hart Mike G
9 8 I Hart Mike G
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L),
Clients = c("A",
"B", "C", "D", "E", "F", "F", "H", "I"), ADDRESS = c("Cobben",
"Beerstaat", "River", "Cobben", "Cobben", "Cobben", "Hart", "Hart",
"Sparren"), FIRST_NAME = c("Cui", "Cui", "Cui", "Cui", "Cui",
"Cui", "Mike", "Mike", "Mike"), LAST_NAME = c("Z", "Z", "Z",
"Z", "Z", "Z", "G", "G", "G")), class = "data.frame", row.names = c(NA,
-9L))