I am looking to calculate the overall percentage for each row in relation to the total of all rows. However my table has been grouped by multiple columns, so the percentage calculation only works on the next group up to get the 'grouped percentage'.
I am looking for the global percentage, and everything I have googled people are asking for the percentage per group. I haven't found any previous queries on the overall percentage yet.
Hopefully there is a simple dplyr solution as this is quite a simple operation, very similar to the options you get on an excel pivot chart.
I have tried ungrouping and more summarising, also dividing by a recalculation of the total with no luck.
Code:
df %>%
group_by(cat1, cat2) %>%
summarise(`Number` = n_distinct(ID)) %>%
mutate(Share = paste0(round((`Number`/sum(`Number`))*100, digits = 0), '%'))
what I get: we can see the percentage figure is per the first grouped column.
# cat1, cat2, Number, Share
# A D 25289 69
# A E 4853 13
# A F 6541 18
# B E 7686 100
# C F 4893 100
What I am looking for:
# cat1, cat2, Number, Share
# A D 25289 51
# A E 4853 10
# A F 6541 14
# B E 7686 16
# C F 4893 10
CodePudding user response:
Just ungroup between the summarise
and the mutate
.
Sample data
df <- tibble(
cat1=sample(c("A", "B", "C"), prob=c(0.5, 0.3, 0.2), replace=TRUE, size=1000),
cat2=sample(c("D", "E", "F"), replace=TRUE, size=1000),
ID=1:1000
)
Solve the problem
df %>%
group_by(cat1, cat2) %>%
summarise(Number = n_distinct(ID)) %>%
ungroup() %>%
mutate(Share = paste0(round((Number/sum(Number))*100, digits = 0), '%'))
# A tibble: 9 × 4
cat1 cat2 Number Share
<chr> <chr> <int> <chr>
1 A D 172 17%
2 A E 156 16%
3 A F 179 18%
4 B D 95 10%
5 B E 102 10%
6 B F 104 10%
7 C D 64 6%
8 C E 67 7%
9 C F 61 6%
Or add .groups="drop"
to the summarise
call:
df %>%
group_by(cat1, cat2) %>%
summarise(Number = n_distinct(ID), .groups="drop") %>%
mutate(Share = paste0(round((Number/sum(Number))*100, digits = 0), '%'))
giving the same result.