I have dataset with 5900 rows as follows: Merged_char column is a merged column of 4 different columns with socio-demo characteristics and sum column is the sum of how many observations with each set of characteristics belong to one of 4 groups.
Period | Group | Sum | Merged_char |
---|---|---|---|
01.2021 | 1st | 100 | 2_medium_North_55yo |
01.2021 | 2nd | 210 | 2_medium_North_55yo |
01.2021 | 3rd | 350 | 2_medium_North_55yo |
01.2021 | 4th | 170 | 2_medium_North_55yo |
01.2021 | 1st | 12 | 2_medium_North_20yo |
01.2021 | 2nd | 190 | 2_medium_North_20yo |
01.2021 | 3rd | 250 | 2_medium_North_20yo |
01.2021 | 4th | 266 | 2_medium_North_20yo |
01.2021 | 1st | 450 | 2_medium_South_55yo |
I want to obtain a summary with information what percentage each group represents for each combination of socio-demo. So firstly we have to calculate the sum of col "Sum" for every 4 groups: 100 210 350 170 = 830 and then use the formula for percentages like: 100/830100 = 12% for 1st row , 210/830100= 25,3 % etc. For the next set of Merged_char the sum of the sum column will be different, as 12 190 250 266= 718. So the output should look like this:
Period | Group | Sum | Merged_char | Percentiles |
---|---|---|---|---|
01.2021 | 1st | 100 | 2_medium_North_55yo | 12,0 |
01.2021 | 2nd | 210 | 2_medium_North_55yo | 25,3 |
01.2021 | 3rd | 350 | 2_medium_North_55yo | 42,2 |
01.2021 | 4th | 170 | 2_medium_North_55yo | 20,4 |
01.2021 | 1st | 12 | 2_medium_North_20yo | 1,7 |
01.2021 | 2nd | 190 | 2_medium_North_20yo | 26,4 |
01.2021 | 3rd | 250 | 2_medium_North_20yo | 37,0 |
01.2021 | 4th | 266 | 2_medium_North_20yo | 37,5 |
01.2021 | 1st | 450 | 2_medium_South_55yo | ... |
To do that I tried to filter through group list, which have only 4 elements: 1st, 2nd, 3rd, 4th. Problem is with Merged_char column, because in my dataset I have 144 different combinations of it and the making the list with 144 elements is not the best way.
Thank you for helping
CodePudding user response:
library(dplyr)
df%>%
group_by(Merged_char)%>%
mutate(Percentile=Sum/sum(Sum,na.rm=T)*100)%>%
ungroup
Period Group Sum Merged_char Percentile
<dbl> <fct> <int> <fct> <dbl>
1 1.20 1st 100 2_medium_North_55yo 12.0
2 1.20 2nd 210 2_medium_North_55yo 25.3
3 1.20 3rd 350 2_medium_North_55yo 42.2
4 1.20 4th 170 2_medium_North_55yo 20.5
5 1.20 1st 12 2_medium_North_20yo 1.67
6 1.20 2nd 190 2_medium_North_20yo 26.5
7 1.20 3rd 250 2_medium_North_20yo 34.8
8 1.20 4th 266 2_medium_North_20yo 37.0
9 1.20 1st 450 2_medium_South_55yo 100