I want to count how many times the names in column y match those of column x. Trick is there are multiple names within each cell and there is no pattern so I cannot simply use str_count.
or see here reproducible code:
dt <-
data.frame(group = c('A', 'A', 'B', 'B'), names_X = c('James, Robert, Phill', 'James, Blake, Paul', 'Lucy, Macy', 'Lucy, Caty'),
names_Y = c('Robert, Peter', 'Robert, Peter', 'Macy, Lucy', 'Caty, Jess, Carla'))
dt %>%
group_by(group) %>%
mutate(matches = str_count(names_X, names_Y))
for which this is the output
structure(list(group = c("A", "A", "B", "B"), names_X = c("James, Robert, Phill",
"James, Blake, Paul", "Lucy, Macy", "Lucy, Caty"), names_Y = c("Robert, Peter",
"Robert, Peter", "Macy, Lucy", "Caty, Jess, Carla"), matches = c(0L,
0L, 0L, 0L)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), groups = structure(list(group = c("A",
"B"), .rows = structure(list(1:2, 3:4), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))
but the result for #matches remains 0 in all cases which is wrong.
See correct answer in picture at the top of page.
CodePudding user response:
dt$names_X <- strsplit(dt$names_X, ", ")
dt$names_Y <- strsplit(dt$names_Y, ", ")
dt$count <- mapply(\(x, y) sum(x %in% y), dt$names_X, dt$names_Y)
dt$matching_names <- mapply(
\(x, y) if (any(x %in% y)) toString(x[x %in% y]) else NA,
dt$names_X,
dt$names_Y
)
# group names_X names_Y count matching_names
# 1 A James, Robert, Phill Robert, Peter 1 Robert
# 2 A James, Blake, Paul Robert, Peter 0 <NA>
# 3 B Lucy, Macy Macy, Lucy 2 Lucy, Macy
# 4 B Lucy, Caty Caty, Jess, Carla 1 Caty
CodePudding user response:
Was partway through writing a tidyverse
answer when other answer was posted, which is an excellent and clear solution. For completeness, if you want to do in tidyverse this could be a helpful start:
library(tidyverse)
dt |>
mutate(names_Y = str_split(names_Y, ", "),
names_X = str_split(names_X, ", ")) |>
rowwise() |>
mutate(
count = sum(names_X %in% names_Y),
names_match = paste(intersect(names_X, names_Y), collapse = ", ")
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> group names_X names_Y count names_match
#> <chr> <list> <list> <int> <chr>
#> 1 A <chr [3]> <chr [2]> 1 "Robert"
#> 2 A <chr [3]> <chr [2]> 0 ""
#> 3 B <chr [2]> <chr [2]> 2 "Lucy, Macy"
#> 4 B <chr [2]> <chr [3]> 1 "Caty"