Home > Back-end >  How to do conditional grouping of data in R?
How to do conditional grouping of data in R?

Time:07-14

Here I have sales data by year and model:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))

   model Year sales
1      A 2017   900
..................
17     B 2020    14
18     C 2020    15
19     D 2020   170
20     E 2020 1180

Here I add the share & cumulative shares columns and apply the following conditional statement: If the model has a cumulative share > 90% in 2020, it is categorized as "insignificant". So the condition is applied to only 2020 and next the result is spread over the entire period. For instance, If I get E, A models and others categorised as insignificant in 2020, next I need to separate E and A models and convert others as insignificant every year.

    df2 <- df  %>% 
      group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
             order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
      mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
  mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
  mutate(model = ifelse(threshold.90, model, 'insignificant'))

           model Year sales      Share order CumulativeShare threshold.90
1              A 2017   900 43.2692308     1        43.26923         TRUE
2  insignificant 2017   456 21.9230769     2        65.19231        FALSE
3  insignificant 2017   345 16.5865385     3        81.77885        FALSE
4  insignificant 2017   235 11.2980769     4        93.07692        FALSE
5              E 2017   144  6.9230769     5       100.00000         TRUE
6  insignificant 2018   555 24.9662618     1        24.96626        FALSE
7  insignificant 2018   456 20.5128205     2        45.47908        FALSE
8  insignificant 2018   445 20.0179937     3        65.49708        FALSE
9              E 2018   434 19.5231669     4        85.02024         TRUE
10             A 2018   333 14.9797571     5       100.00000         TRUE
11             A 2019  8911 31.6452999     1        31.64530         TRUE
12             E 2019  5555 19.7272630     2        51.37256         TRUE
13 insignificant 2019  4567 16.2186157     3        67.59118        FALSE
14 insignificant 2019  4566 16.2150645     4        83.80624        FALSE
15 insignificant 2019  4560 16.1937569     5       100.00000        FALSE
16             E 2020  1180 73.6119775     1        73.61198         TRUE
17             A 2020   224 13.9737991     2        87.58578         TRUE
18 insignificant 2020   170 10.6051154     3        98.19089        FALSE
19 insignificant 2020    15  0.9357455     4        99.12664        FALSE
20 insignificant 2020    14  0.8733624     5       100.00000        FALSE

However, if single model has share above 90% in 2020 logically all the model would have cumulative share above 90%. Hence, all of them are categorised as "insignificant". For example if we change last value from 1180 to 20000 in dataframe, output well be like this:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df  %>% ...:


           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  insignificant 2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  insignificant 2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 insignificant 2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 insignificant 2020 20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

So, I want to avoid this specific situation and add one more condition:

If a single model's SHARE is more than 90% in 2020, it should be left separate and all the other models should be categorised as insignificant.

Expected output:

           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  E             2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  E             2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 E             2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 E                  20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

CodePudding user response:

I think I would use a couple of temporary variables to help you keep track here. Essentially you need to know the first-placed model in the final year as well as the cumulative values of the final year. Then any model that meets the conditions 'Less than 90 in the final year OR first entry in the final year' is retained.

df  %>% 
  group_by(Year) %>% 
  mutate(Share = 100 * sales/ sum(sales),
        order = order(order(-Share))) %>% 
  arrange(Year, order, by_group = TRUE) %>%
  mutate(CumulativeShare= cumsum(Share)) %>%
  ungroup() %>%
  mutate(finalyear = Year == max(Year),
         finval = CumulativeShare[finalyear][match(model, model[finalyear])],
         finlast = c(FALSE, diff(finalyear) == 1),
         keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
         model = ifelse(keep, model, 'insignificant')) %>%
  select(-finalyear, -finval, -finlast, -keep)

With your first example data set, this would look like

#> # A tibble: 20 x 6
#>    model          Year sales  Share order CumulativeShare
#>    <chr>         <dbl> <dbl>  <dbl> <int>           <dbl>
#>  1 A              2017   900 43.3       1            43.3
#>  2 insignificant  2017   456 21.9       2            65.2
#>  3 insignificant  2017   345 16.6       3            81.8
#>  4 insignificant  2017   235 11.3       4            93.1
#>  5 E              2017   144  6.92      5           100  
#>  6 insignificant  2018   555 25.0       1            25.0
#>  7 insignificant  2018   456 20.5       2            45.5
#>  8 insignificant  2018   445 20.0       3            65.5
#>  9 E              2018   434 19.5       4            85.0
#> 10 A              2018   333 15.0       5           100  
#> 11 A              2019  8911 31.6       1            31.6
#> 12 E              2019  5555 19.7       2            51.4
#> 13 insignificant  2019  4567 16.2       3            67.6
#> 14 insignificant  2019  4566 16.2       4            83.8
#> 15 insignificant  2019  4560 16.2       5           100  
#> 16 E              2020  1180 73.6       1            73.6
#> 17 A              2020   224 14.0       2            87.6
#> 18 insignificant  2020   170 10.6       3            98.2
#> 19 insignificant  2020    15  0.936     4            99.1
#> 20 insignificant  2020    14  0.873     5           100

And with your second data set, it would look like this:

#> # A tibble: 20 x 6
#>    model          Year sales   Share order CumulativeShare
#>    <chr>         <dbl> <dbl>   <dbl> <int>           <dbl>
#>  1 insignificant  2017   900 43.3        1            43.3
#>  2 insignificant  2017   456 21.9        2            65.2
#>  3 insignificant  2017   345 16.6        3            81.8
#>  4 insignificant  2017   235 11.3        4            93.1
#>  5 E              2017   144  6.92       5           100  
#>  6 insignificant  2018   555 25.0        1            25.0
#>  7 insignificant  2018   456 20.5        2            45.5
#>  8 insignificant  2018   445 20.0        3            65.5
#>  9 E              2018   434 19.5        4            85.0
#> 10 insignificant  2018   333 15.0        5           100  
#> 11 insignificant  2019  8911 31.6        1            31.6
#> 12 E              2019  5555 19.7        2            51.4
#> 13 insignificant  2019  4567 16.2        3            67.6
#> 14 insignificant  2019  4566 16.2        4            83.8
#> 15 insignificant  2019  4560 16.2        5           100  
#> 16 E              2020 20000 97.9        1            97.9
#> 17 insignificant  2020   224  1.10       2            99.0
#> 18 insignificant  2020   170  0.832      3            99.9
#> 19 insignificant  2020    15  0.0734     4            99.9
#> 20 insignificant  2020    14  0.0686     5           100

Created on 2022-07-14 by the reprex package (v2.0.1)

CodePudding user response:

This should do the trick:

df  %>% 
group_by(Year) %>% 
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) &  Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))
  • Related