The problem I am working on is to select rows in a table A (N rows in total, many columns) if the value is %in% table B (K rows in total, 1 column only). Here N ~ 1 bn >> K ~ 100k
I wonder if I should just select rows by:
sub_set = df_A[df_A$Key %in% df_B$Key,]
Or should I utilize some inner_join or merge function:
sub_set = df_A %>% inner_join(df_B, by="Key")
When N is very large?
CodePudding user response:
Let's generate 1M rows of sample data and benchmark both approaches:
df_A <- data.frame(key = sample(letters, size = 1e6, replace = T), val = rnorm(1e6))
df_B <- data.frame(key = letters[1:10])
library(microbenchmark)
library(tidyverse)
microbenchmark(
df_A[df_A$key %in% df_B$key, ],
inner_join(df_A, df_B, by = 'key'),
times = 1000
)
Unit: milliseconds
expr min lq mean median uq max
df_A[df_A$key %in% df_B$key, ] 25.91352 27.57424 32.74825 28.08004 36.96685 158.8774
inner_join(df_A, df_B, by = "key") 46.75017 48.29013 53.95744 48.63215 57.25996 164.4909
neval
1000
1000
Note the median benchmarks. It looks like %in%
combined with logical indexing is faster than dplyr::inner_join
.
CodePudding user response:
Under some situations, you might find data.table join approach to be quite fast
library(data.table)
setDT(df_A)[setDT(df_B), on=.(key)]
Using jdobres nice benchmarking example, I found this to be about twice as fast as the %in%
approach
Unit: milliseconds
expr min lq mean median uq max neval
df_A[df_A$key %in% df_B$key, ] 27.3338 27.91770 37.03861 28.47005 41.30420 309.4625 100
inner_join(df_A, df_B, by = "key") 64.7635 66.39085 77.59911 67.94265 89.34715 347.9862 100
df_A[df_B, on = .(key)] 11.6840 12.24055 15.45660 12.59900 13.30475 41.5436 100