I have two data frames that I am left joining. Then I want to compare multiple set of columns to see if they are equal. Little background: Every time I refresh data, I want to compare same fields between current data and previous data. If there is any change in data, I want to know exactly which column has changed and create a kind of list of columns that got changed.
Ex.
dset <- data.frame(var1= c(1:3),
var2 =c("a123","b123","c123"),
var3 =c("x123","y123","z123"))
dset2 <- data.frame(var1 = c(1:3),
var2 =c("A123","B123","c123"),
var3 =c("x123","Y123","Z123"))
dset <- dset %>%
left_join(dset2, by = c("var1")) %>%
mutate(change = paste0(if_else(var2.x == var2.y, " ", "var2; "),
if_else(var3.x == var3.y, " ", "var3")
))
I was able to do it using tens of if_else
but I was curious as to whether or not there is a smarter and dynamic process to do it.
var1 var2.x var3.x var2.y var3.y change
1 1 a123 x123 A123 x123 var2;
2 2 b123 y123 B123 Y123 var2; var3
3 3 c123 z123 c123 Z123 var3
As I said, I did it with the help of if_else
but there has to be an easier way to do it. I have like 20 variables to be compared so 20 lines of if_else
CodePudding user response:
Here is one base R method -
#List the columns to compare
fields_to_compare <- paste0('var', 2:3)
#If there is no pattern in column names to auto generate them like above
#you need to manually list them one by one.
#fields_to_compare <- c('var2', 'var3')
#Do a join between the two datasets with "var1" as key
out <- merge(dset, dset2, by = 'var1', all.x = TRUE)
#create two sets of columns to compare
fields1 <- paste0(fields_to_compare, '.x')
fields2 <- paste0(fields_to_compare, '.y')
#compare the two sets of columns element wise and assign the column names
#where the value is different.
out$change <- apply(out[fields1] != out[fields2], 1,
function(x) paste0(fields_to_compare[x], collapse = ';'))
out
# var1 var2.x var3.x var2.y var3.y change
#1 1 a123 x123 A123 x123 var2
#2 2 b123 y123 B123 Y123 var2;var3
#3 3 c123 z123 c123 Z123 var3
CodePudding user response:
With the tidyverse
route, we may loop across
the .x
columns, get the value of the corresponding .y
columns by replacing the column name (cur_column()
) .x
substring with .y
and use either cur_data()
or pick
from devel version to extract the .y
column and apply the ifelse/case_when
to create new columns and then use unite
to paste
those new columns
library(dplyr)
library(tidyr)
library(stringr)
dset %>%
left_join(dset2, by = "var1") %>%
mutate(across(ends_with(".x"),
# or in devel version
# ~ case_when(.x != pick(str_replace(cur_column(), fixed(".x"), ".y"))[[1]]
#~ str_remove(cur_column(), fixed(".x"))), .names = "tmp_{.col}"))
~ case_when(.x != cur_data()[[(str_replace(cur_column(), fixed(".x"),
".y"))]]~ str_remove(cur_column(), fixed(".x"))),
.names = "tmp_{.col}")) %>%
unite(change, starts_with("tmp_"), sep = "; ", na.rm = TRUE)
-output
var1 var2.x var3.x var2.y var3.y change
1 1 a123 x123 A123 x123 var2
2 2 b123 y123 B123 Y123 var2; var3
3 3 c123 z123 c123 Z123 var3