Home > OS >  compare data between two consecutive transfers and identify which variable has been updated
compare data between two consecutive transfers and identify which variable has been updated

Time:01-14

I have two data frames that I am left joining. Then I want to compare multiple set of columns to see if they are equal. Little background: Every time I refresh data, I want to compare same fields between current data and previous data. If there is any change in data, I want to know exactly which column has changed and create a kind of list of columns that got changed.

Ex.

dset <- data.frame(var1= c(1:3),
                   var2 =c("a123","b123","c123"),
                   var3 =c("x123","y123","z123"))

dset2 <- data.frame(var1 = c(1:3),
            var2 =c("A123","B123","c123"),
            var3 =c("x123","Y123","Z123"))

dset <- dset %>% 
  left_join(dset2, by = c("var1")) %>% 
  mutate(change = paste0(if_else(var2.x == var2.y, " ", "var2; "), 
                         if_else(var3.x == var3.y, " ", "var3")
                         ))

I was able to do it using tens of if_else but I was curious as to whether or not there is a smarter and dynamic process to do it.

  var1 var2.x var3.x var2.y var3.y    change
1    1   a123   x123   A123   x123    var2;  
2    2   b123   y123   B123   Y123    var2; var3
3    3   c123   z123   c123   Z123    var3

As I said, I did it with the help of if_else but there has to be an easier way to do it. I have like 20 variables to be compared so 20 lines of if_else

CodePudding user response:

Here is one base R method -

#List the columns to compare
fields_to_compare <- paste0('var', 2:3)
#If there is no pattern in column names to auto generate them like above
#you need to manually list them one by one. 
#fields_to_compare <- c('var2', 'var3')

#Do a join between the two datasets with "var1" as key
out <- merge(dset, dset2, by = 'var1', all.x = TRUE)

#create two sets of columns to compare
fields1 <- paste0(fields_to_compare, '.x') 
fields2 <- paste0(fields_to_compare, '.y') 

#compare the two sets of columns element wise and assign the column names 
#where the value is different. 
out$change <- apply(out[fields1] != out[fields2], 1, 
      function(x) paste0(fields_to_compare[x], collapse = ';'))

out

#  var1 var2.x var3.x var2.y var3.y    change
#1    1   a123   x123   A123   x123      var2
#2    2   b123   y123   B123   Y123 var2;var3
#3    3   c123   z123   c123   Z123      var3

CodePudding user response:

With the tidyverse route, we may loop across the .x columns, get the value of the corresponding .y columns by replacing the column name (cur_column()) .x substring with .y and use either cur_data() or pick from devel version to extract the .y column and apply the ifelse/case_when to create new columns and then use unite to paste those new columns

library(dplyr)
library(tidyr)
library(stringr)
dset %>% 
  left_join(dset2, by = "var1") %>% 
   mutate(across(ends_with(".x"), 
   # or in devel version
   # ~ case_when(.x != pick(str_replace(cur_column(), fixed(".x"), ".y"))[[1]]  
   #~ str_remove(cur_column(), fixed(".x"))), .names = "tmp_{.col}"))
   ~ case_when(.x != cur_data()[[(str_replace(cur_column(), fixed(".x"), 
     ".y"))]]~ str_remove(cur_column(), fixed(".x"))),
      .names = "tmp_{.col}")) %>% 
  unite(change, starts_with("tmp_"), sep = "; ", na.rm = TRUE)

-output

  var1 var2.x var3.x var2.y var3.y     change
1    1   a123   x123   A123   x123       var2
2    2   b123   y123   B123   Y123 var2; var3
3    3   c123   z123   c123   Z123       var3
  • Related