Home > database >  Finding the percentage of a group over the sum of all groups for a given date in R
Finding the percentage of a group over the sum of all groups for a given date in R

Time:07-26

I am trying to create a line graph of the percentage of total events per group over time, I don't know if that wording entirely makes sense but hopefully the example will help

I have the following dataframes:

original_dataframe

|Group|Date      |
|-----|----------|
|A    |01/01/2022|
|A    |01/01/2022|
|A    |01/01/2022|
|A    |01/01/2022|
|A    |01/01/2022|
|A    |01/01/2022|
|B    |01/01/2022|
|B    |01/01/2022|
|C    |01/01/2022|
|C    |01/01/2022|
|A    |02/01/2022|
|A    |02/01/2022|
|A    |02/01/2022|
|B    |02/01/2022|
|B    |02/01/2022|
|B    |02/01/2022|
|B    |02/01/2022|
|B    |02/01/2022|
|C    |02/02/2022|
|C    |02/02/2022|

summary_dataframe

|Group|Date      |Count|
|-----|----------|-----|
|A    |01/01/2022|6    |
|B    |01/01/2022|2    |
|C    |01/01/2022|2    |
|A    |02/01/2022|3    |
|B    |02/01/2022|5    |
|C    |02/02/2022|2    |

And want to create a dataframe that contains the following for plotting:

|Group|Date      |Perc|
|-----|----------|----|
|A    |01/01/2022|60% |
|B    |01/01/2022|20% |
|C    |01/01/2022|20% |
|A    |02/01/2022|30% |
|B    |02/01/2022|50% |
|C    |02/02/2022|20% |

CodePudding user response:

Something like below should get you there

I started out from the original data.frame. Everything is in one code block, but you can split the code where I marked it.

library(dplyr)
library(ggplot2)

df1 %>% 
  group_by(Date, Group) %>% 
  summarise(cnt = n()) %>% 
  mutate(Perc = cnt / sum(cnt)) %>% # split here to start ggplot with aggregated data.
  ggplot(aes(x = Date, y = Perc, color = Group))   
  geom_line()  
  ylim(0, 1)

enter image description here

data:

df1 <- structure(list(Group = c("A", "A", "A", "A", "A", "A", "B", "B", 
"C", "C", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C"), 
    Date = structure(c(18993, 18993, 18993, 18993, 18993, 18993, 
    18993, 18993, 18993, 18993, 18994, 18994, 18994, 18994, 18994, 
    18994, 18994, 18994, 18994, 18994), class = "Date")), row.names = c(NA, 
-20L), class = "data.frame")

CodePudding user response:

You can use group_by and summarise for this

library(tidyverse)

df <- structure(list(group_date = c("A", "A", "A", "A", "A", "A", "B", 
                              "B", "C", "C", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C"
), x = c("01/01/2022", "01/01/2022", "01/01/2022", "01/01/2022", 
         "01/01/2022", "01/01/2022", "01/01/2022", "01/01/2022", "01/01/2022", 
         "01/01/2022", "02/01/2022", "02/01/2022", "02/01/2022", "02/01/2022", 
         "02/01/2022", "02/01/2022", "02/01/2022", "02/01/2022", "02/02/2022", 
         "02/02/2022")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                                 -20L))


df %>% 
  count(group_date, x) %>% 
  group_by(x) %>% 
  summarise(group_date = group_date,
            Perc = n*100/(sum(n)))
# A tibble: 6 × 3
# Groups:   x [3]
  x          group_date  Perc
  <chr>      <chr>      <dbl>
1 01/01/2022 A           60  
2 01/01/2022 B           20  
3 01/01/2022 C           20  
4 02/01/2022 A           37.5
5 02/01/2022 B           62.5
6 02/02/2022 C          100  
  • Related