I want to compare a set of columns and identify the/or several differences between the columns and a reference. My data looks something like this toy data:
df <- data.frame(id = c(1:5),
var1 = c("A","A","A","A","B"),
var2 = c(10,20,10,10,10),
var3 = c("A2", "A2", "A3", "A2", "A2"),
var4 = c("B2", "B2", "B2", "B3", "B2"),
var5 = c("C2", "C2", "C2", "C2", "C4"))
This gives the following dataframe:
id var1 var2 var3 var4 var5
1 1 A 10 A2 B2 C2
2 2 A 20 A2 B2 C2
3 3 A 10 A3 B2 C2
4 4 A 10 A2 B3 C2
5 5 B 10 A2 B2 C4
I also have the following
I want to compare the sequence of V1
,V2
,V3
,V4
and V5
with a reference sequence (found in the vector ref
, see below) and (1) create a new column with the part (column or vector element) of the sequence that differs (2) create a new column indicating in which column the difference was identified.
# Same as the first row of df
ref <- c("A", 10, "A2", "B2", "C2")
The expected output should be:
id diff which
1 1 <NA> <NA>
2 2 20 var2
3 3 A3 var3
4 4 B3 var4
5 5 C4 var5
Hence, with this output I can see that for observation with id == 2
, the difference between the reference-point and the row-sequence was found in var2
and was 20
.
Does anyone know how to do this?
CodePudding user response:
in base R:
l <- apply(df[-1], 1, function(x) x[x != ref])
data.frame(id = 1:nrow(df),
diff = sapply(l, toString),
which = sapply(l, function(x) toString(names(x))))
id diff which
1 1
2 2 20 var2
3 3 A3 var3
4 4 B3 var4
5 5 B, C4 var1, var5
CodePudding user response:
We could use tidyverse
library(dplyr)
library(tidyr)
df %>%
mutate(across(-id, as.character)) %>%
pivot_longer(cols = -id, names_to = 'which', values_to = 'diff') %>%
group_by(id) %>%
filter(ref != diff) %>%
summarise(across(which:diff, toString)) %>%
complete(id = df$id)
-output
# A tibble: 5 × 3
id which diff
<int> <chr> <chr>
1 1 <NA> <NA>
2 2 var2 20
3 3 var3 A3
4 4 var4 B3
5 5 var1, var5 B, C4