I have a data that contains two sets of variables, and I want to compare whether the two sets have the same value. In each line of the two sets of variables, mark it as 1 whenever the same value exists, and 0 otherwise. To illustrate the problem, I generate data a1. I want to determine whether any values in the first group of variables (z1 and x1) and the second group of variables (z2 and x2) are same and generate the variable result.
a1=data.table(z1=c(1:5),x1=c(3:7),z2=c(2:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method. Thanks a lot!
CodePudding user response:
We may loop over the rows, find the length
of the intersect
between the pairs and convert to logical
library(data.table)
a1[, result := (apply(.SD, 1, FUN = function(x)
length(intersect(x[1:2], x[3:4]))) > 0)]
-output
> a1
z1 x1 z2 x2 result
1: 1 3 2 3 1
2: 2 4 3 5 0
3: 3 5 4 4 0
4: 4 6 5 7 0
5: 5 7 6 5 1
With respect to efficiency, dapply
(from collapse
) may be faster compared to apply
library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
length(intersect(x[1:2], x[3:4])))]
Or use a vectorized option with str_detect
library(stringr)
a1[, result := (str_detect(paste(z1, x1), paste0(z2, "|", x2)))]
CodePudding user response:
Here's another generalizable method relying on the column names for each group:
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
a1[, result := as.integer(
apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
# z1 x1 z2 x2 result
# 1: 1 3 2 3 1
# 2: 2 4 3 5 0
# 3: 3 5 4 4 0
# 4: 4 6 5 7 0
# 5: 5 7 6 5 1
CodePudding user response:
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method
You can transform to long form and join to see if there are any matches. I guess it's relatively fast.
# create a row id
a1[, row_id := .I]
# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")
# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])
# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]
# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]
# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]