Home > Blockchain >  R, dplyr. The overall percentage of a grouped table
R, dplyr. The overall percentage of a grouped table

Time:10-28

I am looking to calculate the overall percentage for each row in relation to the total of all rows. However my table has been grouped by multiple columns, so the percentage calculation only works on the next group up to get the 'grouped percentage'.

I am looking for the global percentage, and everything I have googled people are asking for the percentage per group. I haven't found any previous queries on the overall percentage yet.

Hopefully there is a simple dplyr solution as this is quite a simple operation, very similar to the options you get on an excel pivot chart.

I have tried ungrouping and more summarising, also dividing by a recalculation of the total with no luck.

Code:

df %>% 
  group_by(cat1, cat2) %>% 
  summarise(`Number` = n_distinct(ID)) %>% 
  mutate(Share = paste0(round((`Number`/sum(`Number`))*100, digits = 0), '%'))

what I get: we can see the percentage figure is per the first grouped column.

# cat1, cat2, Number, Share
# A     D     25289   69  
# A     E     4853    13  
# A     F     6541    18  
# B     E     7686    100  
# C     F     4893    100  

What I am looking for:

# cat1, cat2, Number, Share
# A     D     25289   51  
# A     E     4853    10  
# A     F     6541    14  
# B     E     7686    16 
# C     F     4893    10 

CodePudding user response:

Just ungroup between the summarise and the mutate.

Sample data

df <- tibble(
        cat1=sample(c("A", "B", "C"), prob=c(0.5, 0.3, 0.2), replace=TRUE, size=1000), 
        cat2=sample(c("D", "E", "F"), replace=TRUE, size=1000),
        ID=1:1000
      )

Solve the problem

df %>% 
  group_by(cat1, cat2) %>% 
  summarise(Number = n_distinct(ID)) %>% 
  ungroup() %>% 
  mutate(Share = paste0(round((Number/sum(Number))*100, digits = 0), '%'))
# A tibble: 9 × 4
  cat1  cat2  Number Share
  <chr> <chr>  <int> <chr>
1 A     D        172 17%  
2 A     E        156 16%  
3 A     F        179 18%  
4 B     D         95 10%  
5 B     E        102 10%  
6 B     F        104 10%  
7 C     D         64 6%   
8 C     E         67 7%   
9 C     F         61 6% 

Or add .groups="drop" to the summarise call:

df %>% 
  group_by(cat1, cat2) %>% 
  summarise(Number = n_distinct(ID), .groups="drop") %>% 
  mutate(Share = paste0(round((Number/sum(Number))*100, digits = 0), '%'))

giving the same result.

  • Related