Home > database >  Fast way to insert values in column of data frame in R
Fast way to insert values in column of data frame in R

Time:10-30

I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame. This is my code, which works perfectly fine, and creates a result which matches my expectation.

set.seed(42)
df <- data.frame(a = sample(1:15, 10), 
                 b=sample(1:15, 10))



unique_to_a <-  df$a[!(df$a %in% df$b)]

unique_to_b <- df$b[!(df$b %in% df$a)]




n <- max(c(unique_to_a, unique_to_b))

out <- data.frame(A=rep(NA,n), B=rep(NA,n))

for (element in unique_to_a){
  out[element, "A"] = element
}

for (element in unique_to_b){
  out[element, "B"] = element
}

out

The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...

Any ideas on how to speed up the operation is much appreciated. Cheers!

CodePudding user response:

Didn't compare the speed but at least this is more concise:

elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
#    X1 X2
# 1  NA NA
# 2  NA NA
# 3  NA  3
# 4  NA NA
# 5  NA NA
# 6  NA NA
# 7  NA NA
# 8  NA NA
# 9  NA NA
# 10 NA NA
# 11 11 NA

CodePudding user response:

Please find here a solution with the data.table package.

Reprex

  • Code
library(data.table)

# 1. Use all the cores of the processor to optimize the processing time
setDTthreads(threads = 0) 
getDTthreads() # in my case, the processor has 4 threads
#> [1] 4


# 2. Code to find unique element between the two columns
setDT(df)[,.(A = fifelse(a %in% b, NA_integer_, a), B = fifelse(b %in% a, NA_integer_, b))]
  • Output
#>      A  B
#>  1: NA NA
#>  2: NA NA
#>  3: NA NA
#>  4: NA NA
#>  5: NA NA
#>  6: NA  3
#>  7: NA NA
#>  8: NA NA
#>  9: NA NA
#> 10: 11 NA

Created on 2021-10-29 by the reprex package (v0.3.0)


PS: following the exchanges with @sindri_baldur (cf.below), I give you the original dataframe "df" as it is generated on my computer with R 4.0.2 (as you can see, the number 3 is in the sixth row of column b and not in row 3; this explains that in the output above, the number 3 is in row 6 and not in row 3)

set.seed(42)
df <- data.frame(a = sample(1:15, 10), 
                 b=sample(1:15, 10))
df
#>     a  b
#> 1   1  9
#> 2   5  5
#> 3  15  4
#> 4   9 10
#> 5  10  2
#> 6   4  3
#> 7   2 15
#> 8  12  1
#> 9  13 12
#> 10 11 13
  • Related