Home > OS >  Aggregating binary columns into one column is taking a long time in R
Aggregating binary columns into one column is taking a long time in R

Time:03-30

I am running a piece of code that was discussed earlier here on a large data with 1.5 million rows and it's been taking hours to run and not done yet. My data looks like this:

ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No

and I'd like to add a column that shows all cities that an ID has travelled to as well as a column showing the number of cities that an ID has travelled to as the following:

ID    London   Paris   Rome    All Cities      Count of Cities travelled
1       Yes     No      Yes    London, Rome                2
2       No      No      Yes     Rome                       1
3       No      Yes     Yes    Paris, Rome                 2
4       No      Yes     No     Paris                       1

and I am running this code that works fine when I run it on a sample of 100 rows of data :

cities <- c('London', 'Paris', 'Rome')

df %>%
  rowwise %>%
  mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
         `Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))

is there any way to improve this code? or make the running time shorter?

Thank you!

CodePudding user response:

Here is a tidyverse approach without using rowwise(), which is known to be terribly slow.

library(tidyverse)

df %>% 
  mutate(across(everything(), ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>% 
  unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>% 
  mutate(`Count of Cities travelled` = str_count(`All Cities`, ",")   1)

  ID London Paris Rome   All Cities Count of Cities travelled
1  1    Yes    No  Yes London, Rome                         2
2  2     No    No  Yes         Rome                         1
3  3     No   Yes  Yes  Paris, Rome                         2
4  4     No   Yes   No        Paris                         1

CodePudding user response:

A possible solution in base R:

df$Cities <- apply(df, 1, \(x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, \(x) sum(x[-1] == "Yes"))
df

#>   ID London Paris Rome       Cities N
#> 1  1    Yes    No  Yes London, Rome 2
#> 2  2     No    No  Yes         Rome 1
#> 3  3     No   Yes  Yes  Paris, Rome 2
#> 4  4     No   Yes   No        Paris 1

With dplyr and rowwise:

library(dplyr)

df %>%
  rowwise %>%
  mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
         N = sum(c_across(2:4) == "Yes")) %>%
  ungroup

#> # A tibble: 4 × 6
#>      ID London Paris Rome  Cities           N
#>   <int> <chr>  <chr> <chr> <chr>        <int>
#> 1     1 Yes    No    Yes   London, Rome     2
#> 2     2 No     No    Yes   Rome             1
#> 3     3 No     Yes   Yes   Paris, Rome      2
#> 4     4 No     Yes   No    Paris            1
  • Related