how to find the difference between two cols with strings-CodePudding

I have a df as below, and I could like to get for each ID what is in subject 1 but not in subject 2, and what is in subject 2 but not in subject 1. Any suggestion/

df <- structure(list(ID = c("Tom", "Jerry", "Marry"), Subject1 = c("Art; Math", 
"ELA;Math", "PE; Math; ELA"), Subject2 = c("Math; PE", "Math; ELA", 
"Math; Bio")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", 
"data.frame"))

CodePudding user response：

We could split the columns and use map2 to find the difference (setdiff) between those two list columns

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
df %>% 
  mutate(In = map2(
             strsplit(Subject1, ";\\s*"),
             strsplit(Subject2, ";\\s*"),
    ~ tibble(`_1_notin_2` = str_c(setdiff(.x, .y), collapse = "; "), 
       `_2_notin_1` = str_c(setdiff(.y, .x), collapse = "; ")))) %>% 
  unnest_wider(In, names_sep = "")

-output

# A tibble: 3 × 5
  ID    Subject1      Subject2  In_1_notin_2 In_2_notin_1
  <chr> <chr>         <chr>     <chr>        <chr>       
1 Tom   Art; Math     Math; PE  "Art"        "PE"        
2 Jerry ELA;Math      Math; ELA ""           ""          
3 Marry PE; Math; ELA Math; Bio "PE; ELA"    "Bio"

CodePudding user response：

It could also be a use-case for rowwise():

library(dplyr)

df |>
  rowwise() |>
  mutate(across(starts_with("S"), ~ strsplit(., ";\\s*")),
         In_1_notin_2 = paste(setdiff(Subject1, Subject2), collapse = "; "),
         In_2_notin_1 = paste(setdiff(Subject2, Subject1), collapse = "; "),
         across(starts_with("S"), ~ paste(., collapse = "; "))) |>
  ungroup()

Output:

# A tibble: 3 × 5
  ID    Subject1      Subject2  In_1_notin_2 In_2_notin_1
  <chr> <chr>         <chr>     <chr>        <chr>       
1 Tom   Art; Math     Math; PE  "Art"        "PE"        
2 Jerry ELA; Math     Math; ELA ""           ""          
3 Marry PE; Math; ELA Math; Bio "PE; ELA"    "Bio"