I am running a piece of code that was discussed earlier here on a large data with 1.5 million rows and it's been taking hours to run and not done yet. My data looks like this:
ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No
and I'd like to add a column that shows all cities that an ID has travelled to as well as a column showing the number of cities that an ID has travelled to as the following:
ID London Paris Rome All Cities Count of Cities travelled
1 Yes No Yes London, Rome 2
2 No No Yes Rome 1
3 No Yes Yes Paris, Rome 2
4 No Yes No Paris 1
and I am running this code that works fine when I run it on a sample of 100 rows of data :
cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))
is there any way to improve this code? or make the running time shorter?
Thank you!
CodePudding user response:
Here is a tidyverse
approach without using rowwise()
, which is known to be terribly slow.
library(tidyverse)
df %>%
mutate(across(everything(), ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>%
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>%
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") 1)
ID London Paris Rome All Cities Count of Cities travelled
1 1 Yes No Yes London, Rome 2
2 2 No No Yes Rome 1
3 3 No Yes Yes Paris, Rome 2
4 4 No Yes No Paris 1
CodePudding user response:
A possible solution in base R:
df$Cities <- apply(df, 1, \(x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, \(x) sum(x[-1] == "Yes"))
df
#> ID London Paris Rome Cities N
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1
With dplyr
and rowwise
:
library(dplyr)
df %>%
rowwise %>%
mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
N = sum(c_across(2:4) == "Yes")) %>%
ungroup
#> # A tibble: 4 × 6
#> ID London Paris Rome Cities N
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1