Here I have sales data by year and model:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))
model Year sales
1 A 2017 900
..................
17 B 2020 14
18 C 2020 15
19 D 2020 170
20 E 2020 1180
Here I add the share & cumulative shares columns and apply the following conditional statement: If the model has a cumulative share > 90% in 2020, it is categorized as "insignificant". So the condition is applied to only 2020 and next the result is spread over the entire period. For instance, If I get E, A models and others categorised as insignificant in 2020, next I need to separate E and A models and convert others as insignificant every year.
df2 <- df %>%
group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
mutate(model = ifelse(threshold.90, model, 'insignificant'))
model Year sales Share order CumulativeShare threshold.90
1 A 2017 900 43.2692308 1 43.26923 TRUE
2 insignificant 2017 456 21.9230769 2 65.19231 FALSE
3 insignificant 2017 345 16.5865385 3 81.77885 FALSE
4 insignificant 2017 235 11.2980769 4 93.07692 FALSE
5 E 2017 144 6.9230769 5 100.00000 TRUE
6 insignificant 2018 555 24.9662618 1 24.96626 FALSE
7 insignificant 2018 456 20.5128205 2 45.47908 FALSE
8 insignificant 2018 445 20.0179937 3 65.49708 FALSE
9 E 2018 434 19.5231669 4 85.02024 TRUE
10 A 2018 333 14.9797571 5 100.00000 TRUE
11 A 2019 8911 31.6452999 1 31.64530 TRUE
12 E 2019 5555 19.7272630 2 51.37256 TRUE
13 insignificant 2019 4567 16.2186157 3 67.59118 FALSE
14 insignificant 2019 4566 16.2150645 4 83.80624 FALSE
15 insignificant 2019 4560 16.1937569 5 100.00000 FALSE
16 E 2020 1180 73.6119775 1 73.61198 TRUE
17 A 2020 224 13.9737991 2 87.58578 TRUE
18 insignificant 2020 170 10.6051154 3 98.19089 FALSE
19 insignificant 2020 15 0.9357455 4 99.12664 FALSE
20 insignificant 2020 14 0.8733624 5 100.00000 FALSE
However, if single model has share above 90% in 2020 logically all the model would have cumulative share above 90%. Hence, all of them are categorised as "insignificant". For example if we change last value from 1180 to 20000 in dataframe, output well be like this:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df %>% ...:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 insignificant 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 insignificant 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 insignificant 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 insignificant 2020 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
So, I want to avoid this specific situation and add one more condition:
If a single model's SHARE is more than 90% in 2020, it should be left separate and all the other models should be categorised as insignificant.
Expected output:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 E 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 E 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 E 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 E 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
CodePudding user response:
I think I would use a couple of temporary variables to help you keep track here. Essentially you need to know the first-placed model in the final year as well as the cumulative values of the final year. Then any model that meets the conditions 'Less than 90 in the final year OR first entry in the final year' is retained.
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>%
arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%
ungroup() %>%
mutate(finalyear = Year == max(Year),
finval = CumulativeShare[finalyear][match(model, model[finalyear])],
finlast = c(FALSE, diff(finalyear) == 1),
keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
model = ifelse(keep, model, 'insignificant')) %>%
select(-finalyear, -finval, -finlast, -keep)
With your first example data set, this would look like
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 A 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 A 2018 333 15.0 5 100
#> 11 A 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 1180 73.6 1 73.6
#> 17 A 2020 224 14.0 2 87.6
#> 18 insignificant 2020 170 10.6 3 98.2
#> 19 insignificant 2020 15 0.936 4 99.1
#> 20 insignificant 2020 14 0.873 5 100
And with your second data set, it would look like this:
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 insignificant 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 insignificant 2018 333 15.0 5 100
#> 11 insignificant 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 20000 97.9 1 97.9
#> 17 insignificant 2020 224 1.10 2 99.0
#> 18 insignificant 2020 170 0.832 3 99.9
#> 19 insignificant 2020 15 0.0734 4 99.9
#> 20 insignificant 2020 14 0.0686 5 100
Created on 2022-07-14 by the reprex package (v2.0.1)
CodePudding user response:
This should do the trick:
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) & Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))