Home > Net >  More efficient way to purrr::map2 for a large dataframe
More efficient way to purrr::map2 for a large dataframe

Time:11-02

Is there a faster way to do the following, where in the real application, df has many rows (and therefore list_of_colnames has the same number of elements):

list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")

map2(split(df, seq(nrow(df))), list_of_colnames, function(row, colnames) {
    row$indicator <- ifelse(any(row[, colnames] %in% some_vector), 1, 0)
    return(row)
  })

While this current implementation works, it takes centuries for the big df. In fact I think split() is a major bottleneck.

Thank you!

CodePudding user response:

One option may be to make use of row/column indexing

rowind <- rep(seq_len(nrow(df)), lengths(list_of_colnames) * nrow(df))
df$indicator <-  (tapply(c(t(df[unlist(list_of_colnames)])) %in% some_vector,
       rowind, FUN = any))

-output

> df
      A   B indicator
1  fish   A         1
2 hello cat         1

data

df <- data.frame(A =  c('fish', 'hello'), B = c('A', 'cat'))

CodePudding user response:

You can avoid splitting your data frame into a list all together and instead apply your condition across the rows using rowwise and c_across from dplyr:

library(dplyr)
library(purrr)

list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")

map(list_of_colnames, ~ 
      df %>% 
      rowwise() %>% 
      mutate(indicator = as.numeric(any(c_across(all_of(.x)) %in% some_vector))) %>% 
      ungroup()
    )

Output

Still mapping over list_of_columns returns a list output:

[[1]]
# A tibble: 3 x 4
  A     B     C     indicator
  <chr> <chr> <chr> <lgl>    
1 fish  dog   bird  TRUE     
2 dog   cat   bird  TRUE     
3 bird  lion  cat   FALSE    

[[2]]
# A tibble: 3 x 4
  A     B     C     indicator
  <chr> <chr> <chr> <lgl>    
1 fish  dog   bird  TRUE     
2 dog   cat   bird  FALSE    
3 bird  lion  cat   FALSE  

Data

structure(list(A = c("fish", "dog", "bird"), B = c("dog", "cat", 
"lion"), C = c("bird", "bird", "cat")), class = "data.frame", row.names = c(NA, 
-3L))
  • Related