compare data between two consecutive transfers and identify which variable has been updated-CodePudding

I have two data frames that I am left joining. Then I want to compare multiple set of columns to see if they are equal. Little background: Every time I refresh data, I want to compare same fields between current data and previous data. If there is any change in data, I want to know exactly which column has changed and create a kind of list of columns that got changed.

Ex.

dset <- data.frame(var1= c(1:3),
                   var2 =c("a123","b123","c123"),
                   var3 =c("x123","y123","z123"))

dset2 <- data.frame(var1 = c(1:3),
            var2 =c("A123","B123","c123"),
            var3 =c("x123","Y123","Z123"))

dset <- dset %>% 
  left_join(dset2, by = c("var1")) %>% 
  mutate(change = paste0(if_else(var2.x == var2.y, " ", "var2; "), 
                         if_else(var3.x == var3.y, " ", "var3")
                         ))

I was able to do it using tens of if_else but I was curious as to whether or not there is a smarter and dynamic process to do it.

  var1 var2.x var3.x var2.y var3.y    change
1    1   a123   x123   A123   x123    var2;  
2    2   b123   y123   B123   Y123    var2; var3
3    3   c123   z123   c123   Z123    var3

As I said, I did it with the help of if_else but there has to be an easier way to do it. I have like 20 variables to be compared so 20 lines of if_else

CodePudding user response：

Here is one base R method -

#List the columns to compare
fields_to_compare <- paste0('var', 2:3)
#If there is no pattern in column names to auto generate them like above
#you need to manually list them one by one. 
#fields_to_compare <- c('var2', 'var3')

#Do a join between the two datasets with "var1" as key
out <- merge(dset, dset2, by = 'var1', all.x = TRUE)

#create two sets of columns to compare
fields1 <- paste0(fields_to_compare, '.x') 
fields2 <- paste0(fields_to_compare, '.y') 

#compare the two sets of columns element wise and assign the column names 
#where the value is different. 
out$change <- apply(out[fields1] != out[fields2], 1, 
      function(x) paste0(fields_to_compare[x], collapse = ';'))

out

#  var1 var2.x var3.x var2.y var3.y    change
#1    1   a123   x123   A123   x123      var2
#2    2   b123   y123   B123   Y123 var2;var3
#3    3   c123   z123   c123   Z123      var3

CodePudding user response：

With the tidyverse route, we may loop across the .x columns, get the value of the corresponding .y columns by replacing the column name (cur_column()) .x substring with .y and use either cur_data() or pick from devel version to extract the .y column and apply the ifelse/case_when to create new columns and then use unite to paste those new columns

library(dplyr)
library(tidyr)
library(stringr)
dset %>% 
  left_join(dset2, by = "var1") %>% 
   mutate(across(ends_with(".x"), 
   # or in devel version
   # ~ case_when(.x != pick(str_replace(cur_column(), fixed(".x"), ".y"))[[1]]  
   #~ str_remove(cur_column(), fixed(".x"))), .names = "tmp_{.col}"))
   ~ case_when(.x != cur_data()[[(str_replace(cur_column(), fixed(".x"), 
     ".y"))]]~ str_remove(cur_column(), fixed(".x"))),
      .names = "tmp_{.col}")) %>% 
  unite(change, starts_with("tmp_"), sep = "; ", na.rm = TRUE)

-output

  var1 var2.x var3.x var2.y var3.y     change
1    1   a123   x123   A123   x123       var2
2    2   b123   y123   B123   Y123 var2; var3
3    3   c123   z123   c123   Z123       var3