Home > other >  group_by: different behavior mutate and summarize
group_by: different behavior mutate and summarize

Time:02-01

I have a dataframe with revenue numbers for different clients in different years. I want to calculate the total revenue for a client in a year. Then I want to calculate the mean of yearly revenue for the 3 highest grossing years.

I manage to do this with group_by and summarize. However in this way I "lose" all variables not used in grouping or calculation functions. I changed summarize to mutate but for one reason or another, this gives me different results.

Does anyone know why this happpens, or how to get same result in a different manner?

MASTERDATA%>%
  group_by(Client,Year)%>%
  summarise(Revenue=sum(Total, na.rm=TRUE))%>%
  slice_max(n=3,order_by=Revenue,with_ties = FALSE )%>%
  mutate(Revenue_top_3 = mean(Revenue))%>%
  select(Client,Year,Revenue)%>%
  ungroup()%>%
  distinct()

Applying this to the mtcars dataset

mtcars%>%
  group_by(gear,carb)%>%
  mutate(Revenue=sum(mpg, na.rm=TRUE))%>%
  slice_max(n=3,order_by=Revenue,with_ties = FALSE )%>%
  mutate(Revenue_top_3 = mean(Revenue))%>%
  select(gear,carb,Revenue)%>%
  ungroup()%>%
  distinct()

CodePudding user response:

In your attempt you are not taking mean of 3 highest grossing years.

Here are two approaches -

library(dplyr)

mtcars%>%
  group_by(gear,carb)%>%
  summarise(Revenue=sum(mpg, na.rm=TRUE)) %>%
  slice_max(n=3,order_by=Revenue,with_ties = FALSE) %>%
  summarise(Revenue_top_3 = mean(Revenue))

#   gear Revenue_top_3
#  <dbl>         <dbl>
#1     3          64.2
#2     4          98.1
#3     5          30.6

And with same output but one less step.

mtcars%>%
  group_by(gear,carb)%>%
  summarise(Revenue=sum(mpg, na.rm=TRUE)) %>%
  summarise(Revenue_top_3 = Revenue %>% sort %>% tail(3) %>% mean())
  •  Tags:  
  • Related