I have a dataframe that looks like this
df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))
Month ID Region Qty Sales Leads
April 11 East 120 1000 10
April 11 West 110 1100 12
May 12 North 110 900 9
May 10 East 110 1000 8
June 11 North 100 1000 6
June 11 East 90 800 5
June 11 West 70 650 4
I want a dataframe that looks like this
Month ID Qty Sales Leads Region
April 11 230 2100 22 East
May 12 110 900 9 North
May 10 110 1000 8 East
June 11 260 2450 15 North
I am using a the following code
result <- df %>% group_by(Month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1)
result$Region <- NULL
I have over 2 million such rows and it is taking forever to calculate the aggregate.
I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.
However I think there could be a more efficient way. Please help on both. Can't figure it out for the life of me.
CodePudding user response:
summarize
makes more sense to me than mutate
and slice
. This should save you some time.
library(dplyr)
result <- df %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups: Month [3]
# Month ID Qty Sales Leads Region
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 April 11 230 2100 22 East
# 2 June 11 260 2450 15 North
# 3 May 10 110 1000 8 East
# 4 May 12 110 900 9 North
CodePudding user response:
We can apply generic speed-up strategies:
- Do less
- Choose an appropriate back-end
- Use appropriate data structures
dplyr
provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.
solution 1
We could rewrite the code slightly to be more efficient by using the collapse
package, which provides a C interface to dplyr
functions. It prepends dplyr
functions with f
, with one exception fsubset
which is similar to dplyr::filter
(or base R subset
).
library(collapse)
df |>
fgroup_by(Month, ID) |>
fsummarise(Qty = fsum(Qty),
Sales = fsum(Sales),
Leads = fsum(Leads),
Region = fsubset(Region, 1L),
keep.group_vars = T) |>
as_tibble() # optional
#> # A tibble: 4 x 6
#> Month ID Qty Sales Leads Region
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 April 11 230 2100 22 East
#> 2 June 11 260 2450 15 North
#> 3 May 10 110 1000 8 East
#> 4 May 12 110 900 9 North
Where |>
(Requires R version > 3.5) is a slightly faster pipe than %>%
. Its result is ungrouped.
solution 2
data.table
is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr
code to use data.table
is using the dtplyr
package, which ships with tidyverse
. We can convert it by adding two lines of code.
library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region)) %>%
as_tibble() # or data.table()
Note that this results is an ungrouped data.frame at the end.
Benchmarks
Approaches are put in wrapper functions for brevity. dplyr
here is www's approach.
bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
expression median mem_alloc n_itr
<bch:expr> <dbl> <bch:byt> <int>
1 collapse 0.316 0B 200
2 dplyr 5.42 8.73KB 195
3 dtplyr 6.67 120.21KB 196
We can see that collapse
is more memory efficient, and significantly faster compared to dplyr
. dtplyr
approach is included here, as its time complexity is different than that of dplyr
.
Further optimizations are still possible - for example by converting numeric columns into integer columns.
side-note
The Region
column loses information in OP's required approach, I.E East and North for April and June in ID 11 is misleading.