Home > Software engineering >  Efficiently merging named lists
Efficiently merging named lists

Time:03-20

I have two extremely large datasets I am looking for a way to efficiently merging them (inner join is fine). I came up with a solution of creating named spit lists on the merging keys and merging them as such.

The solution is unfortunately very inefficient still. Is there a way I can avoid using dplyr at all I believe that is the root of the issue, as well as the slow lapply. Is map a good solution?

Here you can find a reproducible example Thanks in advance!

library(tidyverse)
library(data.table)

named_group_split <- function(.tbl, ...) {
  
  # get names
  grouped <- group_by(.tbl, ...)
  names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = "-")))
  
  # split data
  grouped %>% 
    group_split(.keep = FALSE) %>% 
    rlang::set_names(names)
}


# FIRST SPLIT LIST
set.seed(1)
db_1 <- data.frame(id_1=sample(1:10, 10, replace=T),
                   id_2=sample(LETTERS, 10, replace=T),
value1=runif(10, 1.0, 10.0)) %>% 
  data.table() %>% 
  dplyr::mutate(id_1name=id_1,
                id_2name=id_2) %>% 
  named_group_split(id_1name,id_2name)

# SECOND SPLIT LIST
set.seed(2)
db_2 <- data.frame(id_1=sample(1:10, 1000, replace=T),
                   id_2=sample(LETTERS, 1000, replace=T),
                   value2=runif(1000, 1.0, 10.0)) %>% 
  data.table() %>% 
  dplyr::mutate(id_1name=id_1,
                id_2name=id_2) %>% 
  named_group_split(id_1name,id_2name)


keys <- intersect(unique(c(names(db_1))), unique(c(names(db_2))))

result <- setNames(lapply(keys, function(key){
  result <- db_1[[key]] %>% 
    dplyr::left_join(db_2[[key]])
}),keys) %>% 
  do.call(plyr:::rbind.fill,.) 

WHAT I MANAGED TO DO IS THE FOLLOWING:

library(purrr)

my_lists <- function(list1, list2) {  
  keys <- intersect(unique(c(names(list1))), unique(c(names(list2))))
  result <- map2(list1[keys], list2[keys], c) 
  
  return(result)
  
}

result <- my_lists(db1, db2)

But from there I cannot recreate the db I need...

CodePudding user response:

Perhaps I am missing something, but I think you are looking for a right join.

library(data.table)
set.seed(1)
x <- data.table(id_1=sample(1:10, 10, replace=T),
                id_2=sample(LETTERS, 10, replace=T))
set.seed(2)
y <- data.table(id_1=sample(1:10, 1000, replace=T),
                id_2=sample(LETTERS, 1000, replace=T),
                value=runif(1000, 1.0, 10.0))

res <- y[x, on = c("id_1", "id_2")]
setorderv(res, c("id_1", "id_2"))
res

    id_1 id_2    value
 1:    1    N 8.441071
 2:    1    N 5.680715
 3:    1    U 3.681615
 4:    1    U 7.159788
 5:    1    U 2.732143
 6:    1    U 1.246033
 7:    2    I 5.810975
 8:    2    I 7.015206
 9:    2    J 4.238503
10:    2    J 6.705899
11:    2    J 7.852617
12:    3    O 8.645228
13:    3    O 7.496095
14:    3    O 2.553506
15:    4    J 2.158799
16:    4    J 4.876194
17:    4    J 4.118119
18:    4    J 8.434567
19:    4    J 9.263452
20:    4    J 3.661710
21:    5    E 1.112813
22:    5    E 4.654503
23:    7    G 1.307912
24:    7    G 6.233104
25:    7    G 3.577720
26:    7    G 1.918811
27:    7    G 7.521629
28:    7    V 2.511941
29:    7    V 2.785865
30:    7    V 6.875612
31:    7    V 4.133124
32:    7    V 8.655604
33:    9    U 3.090105
34:    9    U 5.347673

Which is about 250 times faster on my pc, but the effect might be different for the actual set.

Is there a way I can avoid using dplyr at all I believe that is the root of the issue, as well as the slow lapply. Is map a good solution?

Generally when dplyr is too slow it is worth having a look at data.table, or at the dtplyr, which converts dplyr syntax to data.table. lapply might be somewhat slow for large datasets if not handled with care. When variables are properly initialized, the looping approach of choice should not be the main bottleneck of the task. If it is, try vapply.

  • Related