Home > database >  How to subset and group data by double condition?
How to subset and group data by double condition?

Time:07-11

I have sales data by years and models share column which represents the share of each model's sale within the specific year

   model Year sales Share 
1      A 2015   266  6.81%
2      B 2015   177  4.53%
3      C 2015   888  22.7%
...................
59     I 2020     5  0.84%
60     J 2020  1789  94.5% 

Next, I calculate the cumulative share of each model within the specific year and apply these flowing rules:

I take the last year (2020) and order models by cumulative share (sum of shares), when this cumulative share exceeds the 90% threshold, the rest of the models are grouped as a new category called "insignificant". However, if the share of a single model is already more than 90%, all the data (models) are categorised as 'insignificant and the output is like this below, because share of model J in 2020 is more than 90%.

   Year model         sales Share   pos Sumshare
  <dbl> <chr>         <dbl> <dbl> <int>    <dbl>
1  2015 insignificant  3906   100     1      100
2  2016 insignificant  6434   100     1      100
3  2017 insignificant  3967   100     1      100
4  2018 insignificant  5658   100     1      100
5  2019 insignificant 35316   100     1      100
6  2020 insignificant  1892   100     1      100

Now I want to update my IF statement and add one more rule:

leave the condition as it is, but in a special case if a single model's share is more than 90%, leave this model and all the rest aggregate and categorise as "insignificant".

In that case, output gonna be only 2 categories: model 'J' and 'insignificant'

(I need to make minor changes into my code)

Code:

df <- data.frame (model  = c("A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J","A","B","C","D","E","F","G","H","I","J"),
 Year = c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020),
                  sales = c(266,177,888,233,255,677,411,344,122,533,234,567,234,567,232,900,1005,1900,450,345,567,235,456,345,144,333,555,777,111,444,222,223,445,776,331,788,980,1003,456,434,345,2222,3456,456,678,8911,4560,4567,4566,5555,4,18,16,4,17,11,24,4,5,1789))


df2 <- df  %>% 
  group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
         order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
  mutate(Sumshare= cumsum(Share)) %>%ungroup() %>%
  mutate(threshold.90 = model %in% model[Year == max(Year) & Sumshare < 90]) %>%
  mutate(model = ifelse(threshold.90, model, 'insignificant')) %>%
  group_by(Year, model) %>% summarize(sales = sum(sales), Share = sum(Share), .groups = 'keep') %>%
  group_by(Year) %>% mutate(pseudoShare = ifelse(model == 'insignificant', 0, Share)) %>%
  arrange(Year, -pseudoShare, by_group = TRUE) %>%
  ungroup() %>%
  mutate(pos = match(model, model[Year == max(Year)])) %>%
  select(-pseudoShare) %>% 
  group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales)) %>% arrange(Year, -pos, by_group = TRUE) %>%
  mutate(Sumshare = cumsum(Share)) %>%ungroup()

CodePudding user response:

I wrote a condensed version of your script

df |> 
  group_by(Year) |> 
  mutate(share = sales/sum(sales), 
         model = ifelse(Year == 2020 & row_number() >= min(which(cumsum(share) > .9)), model, "Insignificant")) |> 
  group_by(Year, model) |> 
  summarise(share = sum(share) * 100, model = model, sales = sum(sales), .groups = "drop" ) |> 
  distinct()
   Year model          share sales
  <dbl> <chr>          <dbl> <dbl>
1  2015 Insignificant 100     3906
2  2016 Insignificant 100     6434
3  2017 Insignificant 100     3967
4  2018 Insignificant 100     5658
5  2019 Insignificant 100    35316
6  2020 Insignificant   5.44   103
7  2020 J              94.6   1789

Your code can be cleaned up a bit, a lot of mutate calls and group by calls.

You can condense your mutate calls into one parenthesis such as: mutate(a = ..., b = ...)

CodePudding user response:

You can link two (or more) conditions together with the logical &&-operator.

F.e.

(1 == 1) && (2 == 2)
[1] TRUE

(1 == 1) && (1 == 2)
[1] FALSE

If you want to apply more complex conditions (or combinations of them), you can use brackets, similar to mathematics. Fyi, the logical or would be ||.

Hope that helps.

On a said note, I would suggest to split your code into multiple variables / lines as this is a very complex way of writing a relatively easy code.

  • Related