Home > front end >  Calculate Top N products by sales and years
Calculate Top N products by sales and years

Time:04-20

I have the data about sales by years and by-products, let's say like this:

Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)

product by years:

a= 30 11 33 = 74
b= 45 56 32 = 133
c= 23 19 89 = 131
d= 33 45 33 = 111
e= 12 56 24 = 92

Ranking by according to total sales within these 3 years:

1 2 3 4 5 
b c d e a

I want the code which identifies the TOP 2 products (according to total sales within these 3 years) by years and summarises all the rest products as category "other". So the output should be like this:

year     Model          Sale
2010      b              45
2010      c              23
2010      other          30 33 24=92
2011      b              56
2011      c              19
2011      other          11 45 56=112
2012      b              32
2012      c              89
2012      other          33 33 12= 78

CodePudding user response:

A tidyverse solution. Your Sale data appear to be stored as character, which means we'll have to use as.numeric before summing them.

library(tidyverse)

df %>% 
  group_by(Model) %>% 
  mutate(
    Sale = as.numeric(Sale),
    total_sale = sum(Sale)
  ) %>% 
  ungroup %>% 
  mutate(
    model_condensed = ifelse(total_sale %in% rev(sort(unique(total_sale)))[1:2], Model, 'other')
  ) %>% 
  group_by(Year, model_condensed) %>% 
  summarize(Sale = sum(Sale))

   Year model_condensed  Sale
  <dbl> <chr>           <dbl>
1  2010 b                  45
2  2010 c                  23
3  2010 other              87
4  2011 b                  56
5  2011 c                  19
6  2011 other             112
7  2012 b                  32
8  2012 c                  89
9  2012 other              78

The above solution creates the "other" category by matching on the values in Sale. This could cause problems if those values have decimal places (see this question). Instead, we could use a two-step process to identify the top two Models by name, and use this to create the groupings for the total data:

totals <- df %>% 
  group_by(Model) %>% 
  summarize(total_sale = sum(as.numeric(Sale))) %>% 
  arrange(desc(total_sale)) %>% 
  slice_head(n = 2)

df %>% 
  group_by(Year, model_condensed = ifelse(Model %in% totals$Model, Model, 'other')) %>% 
  summarize(Sale = sum(as.numeric(Sale)))
  • Related