Home > Back-end >  count matches between two columns containing multiple elements in each row
count matches between two columns containing multiple elements in each row

Time:01-27

I want to count how many times the names in column y match those of column x. Trick is there are multiple names within each cell and there is no pattern so I cannot simply use str_count.

This is the example and expected result or see here reproducible code:

dt <- 
  data.frame(group = c('A', 'A', 'B', 'B'), names_X = c('James, Robert, Phill', 'James, Blake, Paul', 'Lucy, Macy', 'Lucy, Caty'),
           names_Y = c('Robert, Peter', 'Robert, Peter', 'Macy, Lucy', 'Caty, Jess, Carla'))
    dt %>% 
      group_by(group) %>% 
      mutate(matches = str_count(names_X, names_Y))

for which this is the output

structure(list(group = c("A", "A", "B", "B"), names_X = c("James, Robert, Phill", 
"James, Blake, Paul", "Lucy, Macy", "Lucy, Caty"), names_Y = c("Robert, Peter", 
"Robert, Peter", "Macy, Lucy", "Caty, Jess, Carla"), matches = c(0L, 
0L, 0L, 0L)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), groups = structure(list(group = c("A", 
"B"), .rows = structure(list(1:2, 3:4), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -2L), .drop = TRUE))

but the result for #matches remains 0 in all cases which is wrong.

See correct answer in picture at the top of page.

CodePudding user response:

dt$names_X <- strsplit(dt$names_X, ", ")
dt$names_Y <- strsplit(dt$names_Y, ", ")

dt$count <- mapply(\(x, y) sum(x %in% y), dt$names_X, dt$names_Y) 
dt$matching_names <-  mapply(
  \(x, y) if (any(x %in% y)) toString(x[x %in% y]) else NA, 
  dt$names_X, 
  dt$names_Y
) 
#   group              names_X           names_Y count matching_names
# 1     A James, Robert, Phill     Robert, Peter     1         Robert
# 2     A   James, Blake, Paul     Robert, Peter     0           <NA>
# 3     B           Lucy, Macy        Macy, Lucy     2     Lucy, Macy
# 4     B           Lucy, Caty Caty, Jess, Carla     1           Caty

CodePudding user response:

Was partway through writing a tidyverse answer when other answer was posted, which is an excellent and clear solution. For completeness, if you want to do in tidyverse this could be a helpful start:

library(tidyverse)

dt |>
  mutate(names_Y = str_split(names_Y, ", "),
         names_X = str_split(names_X, ", ")) |>
  rowwise() |>
  mutate(
    count = sum(names_X %in% names_Y),
    names_match = paste(intersect(names_X, names_Y), collapse = ", ")
  )
#> # A tibble: 4 × 5
#> # Rowwise: 
#>   group names_X   names_Y   count names_match 
#>   <chr> <list>    <list>    <int> <chr>       
#> 1 A     <chr [3]> <chr [2]>     1 "Robert"    
#> 2 A     <chr [3]> <chr [2]>     0 ""          
#> 3 B     <chr [2]> <chr [2]>     2 "Lucy, Macy"
#> 4 B     <chr [2]> <chr [3]>     1 "Caty"
  • Related