More efficient way of using group_by > mutute

I have a dataframe that looks like this

df <- data.frame("Month" = c("April","April","May","May","June","June","June"),
"ID" = c(11, 11, 12, 10, 11, 11, 11),
"Region" = c("East", "West", "North", "East", "North" ,"East", "West"),
"Qty" = c(120, 110, 110, 110, 100, 90, 70),
"Sales" = c(1000, 1100, 900, 1000, 1000, 800, 650),
"Leads" = c(10, 12, 9, 8, 6, 5, 4))

Month   ID     Region    Qty    Sales   Leads
April   11     East      120    1000    10
April   11     West      110    1100    12
May     12     North     110    900     9
May     10     East      110    1000    8
June    11     North     100    1000    6
June    11     East      90     800     5
June    11     West      70     650     4

I want an end df that looks like this

Month   ID     Qty     Sales   Leads   Region
April   11     230     2100    22      East
May     12     110     900     9       North
May     10     110     1000    8       East
June    11     260     2450    15      North

I am using a the following code

result <- df %>% group_by(month, ID) %>% mutate(across(.cols = Qty:Leads, ~sum(.x, na.rm = T))) %>% slice(n = 1) 

result$Region <- NULL

I have over 2 million such rows and it is taking forever to calculate the aggregate.

I am using mutate and slice instead of summarize because the df is arranged in a certain way and I want to retain the Region in that first row.

However I think their could be a more efficient way. Pls help on both. Can't figure it out for the life of me.

CodePudding user response：

summarize makes more sense to me than mutate and slice. This should save you some time.

library(dplyr)
result <- df %>%
  group_by(Month, ID) %>%
  summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
            Region = first(Region))
result
# # A tibble: 4 x 6
# # Groups:   Month [3]
#   Month    ID   Qty Sales Leads Region
#   <chr> <dbl> <dbl> <dbl> <dbl> <chr> 
# 1 April    11   230  2100    22 East  
# 2 June     11   260  2450    15 North 
# 3 May      10   110  1000     8 East  
# 4 May      12   110   900     9 North

CodePudding user response：

In addition to www's approach, you will likely get significant speed up if you swap to a data.table backend. The easiest conversion for this would be using the dtplyr package, which ships with tidyverse.

library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
      group_by(Month, ID) %>%
      summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
                Region = first(Region)) %>%
      as_tibble()