I have two data frames, df1
and df2
of identical structure. The first three columns, id
, form
, and instance
identify the participant and form(s). The remaining variable columns, var1
, var2
, and var3
, contain analytic data and are largely identical except for a few slight discrepancies.
I have a third data frame, map
that identifies the id, form(s), and variables that have discrepancies, but does not contain the values of these discrepancies. I would like to use the map
data to create a final data frame, df.final
that appends two columns and places the discrepant values from df1
and df2
.
Below I am providing sample data as well as a clunky for
loop that creates the desired df.final
, but it is very slow (takes several hours to run on the full dataset) - so slow in fact it is functionally unusable (this ideally needs to be updated in near real time). I am hoping someone better at coding than me can provide alternative, faster solutions.
(Note that given the simplicity of the example data, there are alternate ways to compare df1
and df2
, but assume these don't work and using map
is the only option.)
# Example data, df1
df1 <- data.frame(id = rep(sprintf("K00%s", 0:9), each = 3),
form = rep(c("A","B", "B"), times = 10),
instance = rep(c("None", "1", "2"), times = 10),
var1 = sample(LETTERS, 30, replace = TRUE),
var2 = rnbinom(30, mu = 1, size = 0.02),
var3 = sample(c("Apples", "Oranges", "Pears"), 30, replace = TRUE))
# Sample data df2, same as df1 but with slight discrepancies
df2 <- df1
df2[15, 4] <- "A"
df2[c(4, 6, 8), 5] <- c(11,15,16)
df2[27:28, 6] <- "Bannanas"
# Example "Map" that only indicates what ID, form, instance, and variable is discrepant
map <- data.frame(id = c("K004", "K001", "K001", "K002", "K008", "K009"),
form = c("B","A", "B", "B", "B", "A"),
instance = c("2", "None", "2", "1", "2", "None"),
variable = c("var1", rep("var2", 3), "var3", "var3"))
# id form instance variable
# 1 K004 B 2 var1
# 2 K001 A None var2
# 3 K001 B 2 var2
# 4 K002 B 1 var2
# 5 K008 B 2 var3
# 6 K009 A None var3
## - - - - - - - - - - - - - - - - - - -
# Attempt, but VERY slow in full data
## - - - - - - - - - - - - - - - - - - -
df.final <- data.frame(matrix(NA, ncol = 6))
for (i in 1:nrow(map)){
keepcols <- c("id", "form","instance", map[i,4])
m1 <- merge(map[i,], df1[ , keepcols], by = keepcols[-4])
m2 <- merge(m1, df2[, keepcols], by = keepcols[-4])
df.final[i,] <- m2
}
names(df.final) <- c(names(map), "df1_entry","df2_entry")
df.final
# id form instance variable df1_entry df2_entry
# 1 K004 B 2 var1 P A
# 2 K001 A None var2 0 11
# 3 K001 B 2 var2 0 15
# 4 K002 B 1 var2 0 16
# 5 K008 B 2 var3 Pears Bannanas
# 6 K009 A None var3 Apples Bannanas
CodePudding user response:
I didn't use your map
frame, so my apologies if I have not answered your question, but I didn't think you needed it. I believe this does what you want:
inner_join(
df1 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3,values_to = "df1"),
df2 %>% mutate(across(starts_with("var"),as.character)) %>% pivot_longer(cols=var1:var3, values_to="df2"),
on=c("id","form","instance","name")
) %>% filter(df1!=df2)
Output:
id form instance name df1 df2
<chr> <chr> <chr> <chr> <chr> <chr>
1 K001 A None var2 0 11
2 K001 B 2 var2 0 15
3 K002 B 1 var2 0 16
4 K004 B 2 var1 J A
5 K008 B 2 var3 Apples Bannanas
6 K009 A None var3 Apples Bannanas
If you absolutely have to use map
, but you have access to df1
and df2
, you can do this:
map %>%
inner_join(
df1 %>%
mutate(across(starts_with("var"),as.character)) %>%
pivot_longer(cols=var1:var3,values_to = "df1",names_to="variable"),
by=c("id","form","instance","variable")
) %>%
inner_join(
df2 %>%
mutate(across(starts_with("var"),as.character)) %>%
pivot_longer(cols=var1:var3,values_to = "df2",names_to="variable"),
by=c("id","form","instance","variable")
)
)
CodePudding user response:
In case this might be useful, you can do the following in data.table
:
library(data.table)
setDT(df1)
setDT(df2)
merge(
melt(df1, 1:3),
melt(df2, 1:3),
by = c("id", "form", "instance", "variable")
)[value.x != value.y]
Output
id form instance variable value.x value.y
1: K001 A None var2 1 11
2: K001 B 2 var2 22 15
3: K002 B 1 var2 0 16
4: K004 B 2 var1 S A
5: K008 B 2 var3 Oranges Bannanas
6: K009 A None var3 Pears Bannanas