Home > Software design >  R: %in% vs inner_join
R: %in% vs inner_join

Time:03-22

The problem I am working on is to select rows in a table A (N rows in total, many columns) if the value is %in% table B (K rows in total, 1 column only). Here N ~ 1 bn >> K ~ 100k

I wonder if I should just select rows by:

sub_set = df_A[df_A$Key %in% df_B$Key,]

Or should I utilize some inner_join or merge function:

sub_set = df_A %>% inner_join(df_B, by="Key")

When N is very large?

CodePudding user response:

Let's generate 1M rows of sample data and benchmark both approaches:

df_A <- data.frame(key = sample(letters, size = 1e6, replace = T), val = rnorm(1e6))
df_B <- data.frame(key = letters[1:10])

library(microbenchmark)
library(tidyverse)

microbenchmark(
  df_A[df_A$key %in% df_B$key, ], 
  inner_join(df_A, df_B, by = 'key'),
  times = 1000
)

Unit: milliseconds
                               expr      min       lq     mean   median       uq      max
     df_A[df_A$key %in% df_B$key, ] 25.91352 27.57424 32.74825 28.08004 36.96685 158.8774
 inner_join(df_A, df_B, by = "key") 46.75017 48.29013 53.95744 48.63215 57.25996 164.4909
 neval
  1000
  1000

Note the median benchmarks. It looks like %in% combined with logical indexing is faster than dplyr::inner_join.

CodePudding user response:

Under some situations, you might find data.table join approach to be quite fast

library(data.table)
setDT(df_A)[setDT(df_B), on=.(key)]

Using jdobres nice benchmarking example, I found this to be about twice as fast as the %in% approach

Unit: milliseconds
                               expr     min       lq     mean   median       uq      max neval
     df_A[df_A$key %in% df_B$key, ] 27.3338 27.91770 37.03861 28.47005 41.30420 309.4625   100
 inner_join(df_A, df_B, by = "key") 64.7635 66.39085 77.59911 67.94265 89.34715 347.9862   100
            df_A[df_B, on = .(key)] 11.6840 12.24055 15.45660 12.59900 13.30475  41.5436   100
  • Related