Compare the sequence of several columns and identify what differ between them and a reference point-CodePudding

I want to compare a set of columns and identify the/or several differences between the columns and a reference. My data looks something like this toy data:

df <- data.frame(id = c(1:5),
                 var1 = c("A","A","A","A","B"),
                 var2 = c(10,20,10,10,10),
                 var3 = c("A2", "A2", "A3", "A2", "A2"),
                 var4 = c("B2", "B2", "B2", "B3", "B2"),
                 var5 = c("C2", "C2", "C2", "C2", "C4"))

This gives the following dataframe:

  id var1 var2 var3 var4 var5
1  1    A   10   A2   B2   C2
2  2    A   20   A2   B2   C2
3  3    A   10   A3   B2   C2
4  4    A   10   A2   B3   C2
5  5    B   10   A2   B2   C4

I also have the following I want to compare the sequence of V1,V2,V3,V4 and V5 with a reference sequence (found in the vector ref, see below) and (1) create a new column with the part (column or vector element) of the sequence that differs (2) create a new column indicating in which column the difference was identified.

# Same as the first row of df
ref <- c("A", 10, "A2", "B2", "C2")

The expected output should be:

  id diff which
1  1 <NA>  <NA>
2  2   20  var2
3  3   A3  var3
4  4   B3  var4
5  5   C4  var5

Hence, with this output I can see that for observation with id == 2, the difference between the reference-point and the row-sequence was found in var2 and was 20.

Does anyone know how to do this?

CodePudding user response：

in base R:

l <- apply(df[-1], 1, function(x) x[x != ref])
data.frame(id = 1:nrow(df),
           diff = sapply(l, toString),
           which = sapply(l, function(x) toString(names(x))))

  id  diff      which
1  1                 
2  2    20       var2
3  3    A3       var3
4  4    B3       var4
5  5 B, C4 var1, var5

CodePudding user response：

We could use tidyverse

library(dplyr)
library(tidyr)
df %>%
   mutate(across(-id, as.character)) %>% 
   pivot_longer(cols = -id, names_to = 'which', values_to = 'diff') %>% 
   group_by(id) %>% 
   filter(ref != diff) %>% 
   summarise(across(which:diff, toString)) %>%
   complete(id = df$id)

-output

# A tibble: 5 × 3
     id which      diff 
  <int> <chr>      <chr>
1     1 <NA>       <NA> 
2     2 var2       20   
3     3 var3       A3   
4     4 var4       B3   
5     5 var1, var5 B, C4