I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame. This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated. Cheers!
CodePudding user response:
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA
CodePudding user response:
Please find here a solution with the data.table
package.
Reprex
- Code
library(data.table)
# 1. Use all the cores of the processor to optimize the processing time
setDTthreads(threads = 0)
getDTthreads() # in my case, the processor has 4 threads
#> [1] 4
# 2. Code to find unique element between the two columns
setDT(df)[,.(A = fifelse(a %in% b, NA_integer_, a), B = fifelse(b %in% a, NA_integer_, b))]
- Output
#> A B
#> 1: NA NA
#> 2: NA NA
#> 3: NA NA
#> 4: NA NA
#> 5: NA NA
#> 6: NA 3
#> 7: NA NA
#> 8: NA NA
#> 9: NA NA
#> 10: 11 NA
Created on 2021-10-29 by the reprex package (v0.3.0)
PS: following the exchanges with @sindri_baldur (cf.below), I give you the original dataframe
"df" as it is generated on my computer with R 4.0.2 (as you can see, the number 3 is in the sixth row of column b and not in row 3; this explains that in the output above, the number 3 is in row 6 and not in row 3)
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
df
#> a b
#> 1 1 9
#> 2 5 5
#> 3 15 4
#> 4 9 10
#> 5 10 2
#> 6 4 3
#> 7 2 15
#> 8 12 1
#> 9 13 12
#> 10 11 13