Calculate Top N products by sales with in each year-CodePudding

I have the data about sales by years and by-products, let's say like this:

Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)

I want the code which identifies the TOP 2 products by years and summarises all the rest products as category "other".

CodePudding user response：

We could arrange by 'Year' and 'Sale' in descending order and then change the values of 'Model' based on the row_number after grouping by 'Year'

library(dplyr)
df %>% 
 arrange(Year, desc(Sale)) %>% 
 group_by(Year) %>%
 mutate(Model = case_when(row_number() > 2~ 'other', TRUE ~ Model)) %>%
 ungroup

Or another option is to use slice_max (with_ties = TRUE by default)

df %>%
  group_by(Year) %>% 
  mutate(Model =case_when(Model %in% {cur_data() %>% 
     slice_max(n = 2, order_by = Sale) %>%
     pull(Model)} ~ Model, TRUE ~ "other" )
)

CodePudding user response：

Similar to akrun's solution: Slighlty other strategy:

library(dplyr)

df %>% 
  type.convert(as.is = TRUE) %>% 
  group_by(Year) %>% 
  arrange(desc(Sale), .by_group = TRUE) %>% 
  mutate(Model = ifelse(Model == first(Model, 2), Model, "Other"))

# Groups:   Year [3]
    Year Model  Sale
   <int> <chr> <int>
 1  2010 b        45
 2  2010 d        33
 3  2010 Other    30
 4  2010 Other    24
 5  2010 Other    23
 6  2011 b        56
 7  2011 e        56
 8  2011 Other    45
 9  2011 Other    19
10  2011 Other    11
11  2012 c        89
12  2012 a        33
13  2012 Other    33
14  2012 Other    32
15  2012 Other    12

CodePudding user response：

You can use fct_lump_n() from forcats to collapse levels, but first you need to uncount your data.

library(dplyr)
library(tidyr)
library(forcats)
df |> 
  mutate(Sale = as.integer(Sale)) |>
  uncount(Sale) |> 
  group_by(Year) |> 
  mutate(Model = fct_lump_n(Model, n = 2)) |>
  count(Model)