Home > Net >  Map discrepant variables between two data frames using a third data frame
Map discrepant variables between two data frames using a third data frame

Time:02-26

I have two data frames, df1 and df2 of identical structure. The first three columns, id, form, and instance identify the participant and form(s). The remaining variable columns, var1, var2, and var3, contain analytic data and are largely identical except for a few slight discrepancies.

I have a third data frame, map that identifies the id, form(s), and variables that have discrepancies, but does not contain the values of these discrepancies. I would like to use the map data to create a final data frame, df.final that appends two columns and places the discrepant values from df1 and df2.

Below I am providing sample data as well as a clunky for loop that creates the desired df.final, but it is very slow (takes several hours to run on the full dataset) - so slow in fact it is functionally unusable (this ideally needs to be updated in near real time). I am hoping someone better at coding than me can provide alternative, faster solutions.

(Note that given the simplicity of the example data, there are alternate ways to compare df1 and df2, but assume these don't work and using map is the only option.)

# Example data, df1
df1 <- data.frame(id = rep(sprintf("K00%s", 0:9), each = 3),
                  form = rep(c("A","B", "B"), times = 10),
                  instance = rep(c("None", "1", "2"), times = 10),
                  var1 = sample(LETTERS, 30, replace = TRUE),
                  var2 = rnbinom(30, mu = 1, size = 0.02),
                  var3 = sample(c("Apples", "Oranges", "Pears"), 30, replace = TRUE))

# Sample data df2, same as df1 but with slight discrepancies
df2 <- df1
df2[15, 4] <- "A"
df2[c(4, 6, 8), 5] <- c(11,15,16)
df2[27:28, 6] <- "Bannanas"

# Example "Map" that only indicates what ID, form, instance, and variable is discrepant
map <- data.frame(id = c("K004", "K001", "K001", "K002", "K008", "K009"),
                  form = c("B","A", "B", "B", "B", "A"),
                  instance = c("2", "None", "2", "1", "2", "None"),
                  variable = c("var1", rep("var2", 3), "var3", "var3"))

# id form instance variable
# 1 K004    B        2     var1
# 2 K001    A     None     var2
# 3 K001    B        2     var2
# 4 K002    B        1     var2
# 5 K008    B        2     var3
# 6 K009    A     None     var3

## - - - - - - - - - - - - - - - - - - - 
# Attempt, but VERY slow in full data
## - - - - - - - - - - - - - - - - - - - 

df.final <- data.frame(matrix(NA, ncol = 6))

for (i in 1:nrow(map)){
  keepcols <- c("id", "form","instance", map[i,4])
  m1 <- merge(map[i,], df1[ , keepcols], by = keepcols[-4])
  m2 <- merge(m1, df2[, keepcols], by = keepcols[-4])
  df.final[i,] <- m2
}
names(df.final) <- c(names(map), "df1_entry","df2_entry")
df.final

#   id     form instance variable      df1_entry df2_entry
# 1 K004    B        2     var1         P         A
# 2 K001    A     None     var2         0        11
# 3 K001    B        2     var2         0        15
# 4 K002    B        1     var2         0        16
# 5 K008    B        2     var3     Pears  Bannanas
# 6 K009    A     None     var3    Apples  Bannanas

CodePudding user response:

I didn't use your map frame, so my apologies if I have not answered your question, but I didn't think you needed it. I believe this does what you want:

inner_join(
  df1 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3,values_to = "df1"),
  df2 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3, values_to="df2"),
  on=c("id","form","instance","name")
) %>% filter(df1!=df2)

Output:

  id    form  instance name  df1    df2     
  <chr> <chr> <chr>    <chr> <chr>  <chr>   
1 K001  A     None     var2  0      11      
2 K001  B     2        var2  0      15      
3 K002  B     1        var2  0      16      
4 K004  B     2        var1  J      A       
5 K008  B     2        var3  Apples Bannanas
6 K009  A     None     var3  Apples Bannanas

If you absolutely have to use map, but you have access to df1 and df2, you can do this:

map %>%
  inner_join(
    df1 %>%
      mutate(across(starts_with("var"),as.character)) %>%
      pivot_longer(cols=var1:var3,values_to = "df1",names_to="variable"),
    by=c("id","form","instance","variable")
  ) %>% 
  inner_join(
    df2 %>%
      mutate(across(starts_with("var"),as.character)) %>%
      pivot_longer(cols=var1:var3,values_to = "df2",names_to="variable"),
    by=c("id","form","instance","variable")
  )
)

CodePudding user response:

In case this might be useful, you can do the following in data.table:

library(data.table)

setDT(df1)
setDT(df2)

merge(
  melt(df1, 1:3), 
  melt(df2, 1:3), 
  by = c("id", "form", "instance", "variable")
)[value.x != value.y]

Output

     id form instance variable value.x  value.y
1: K001    A     None     var2       1       11
2: K001    B        2     var2      22       15
3: K002    B        1     var2       0       16
4: K004    B        2     var1       S        A
5: K008    B        2     var3 Oranges Bannanas
6: K009    A     None     var3   Pears Bannanas
  •  Tags:  
  • r
  • Related