Calculating Relative Percentage By Group-CodePudding

I have the following dataset:

my_data = structure(list(state = c("State A", "State A", "State A", "State A", 
"State B", "State B", "State B", "State B", "State A", "State A", 
"State A", "State A", "State B", "State B", "State B", "State B"
), city = c("city 1", "city 1", "city 2", "city 2", "city 3", 
"city 3", "city 4", "city 4", "city 1", "city 1", "city 2", "city 2", 
"city 3", "city 3", "city 4", "city 4"), vaccine = c("yes", "no", 
"yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes", "no", 
"yes", "no", "yes", "no"), counts = c(1221, 2233, 1344, 887, 
9862, 2122, 8772, 2341, 1221, 2233, 1344, 887, 9862, 2122, 8772, 
2341), year = c(2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 
2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022)), row.names = c(NA, 
-16L), class = "data.frame")

My Question: For each city at each year, I want to find out the percent of people who took a vaccine.

The final result would look something like this (I just made up some numbers):

     state   city vaccine Relative_Percentage year
1  State A city 1     yes                 0.6 2021
2  State A city 1      no                 0.4 2021
3  State A city 2     yes                 0.3 2021
4  State A city 2      no                 0.7 2021

Using this post as an example (Relative frequencies / proportions with dplyr), I tried the following code:

library(dplyr)
my_data %>%
  group_by(year, state, city, vaccine) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

But I don't think my code is correct - all percentages are exactly 0.5

`summarise()` has grouped output by 'year', 'state', 'city'. You can override using the `.groups` argument.
# A tibble: 16 x 6
# Groups:   year, state, city [8]
    year state   city   vaccine     n  freq
   <dbl> <chr>   <chr>  <chr>   <int> <dbl>
 1  2021 State A city 1 no          1   0.5
 2  2021 State A city 1 yes         1   0.5

Can someone please show me how to fix this problem?

Thanks!

CodePudding user response：

For each city at each year, I want to find out the percent of people who took a vaccine.

Don't include vaccine in your grouping. You can keep state in your group to differentiate citys. Also, if you want percentage of counts, then you need to calculate that in the summarize; trying to look at it afterwards is not possible since you've dropped counts. Trying to use n in your calculation of freq is only calculating the percentage of rows in the database, not the percent of people who took a vaccine.

And since you want to know which vaccine has which frequency, just add it to the summary.

my_data %>%
  group_by(year, state, city) %>%
  summarise(vaccine, n = n(), freq = counts / sum(counts), .groups = "drop")
# # A tibble: 16 × 6
#     year state   city   vaccine     n  freq
#    <dbl> <chr>   <chr>  <chr>   <int> <dbl>
#  1  2021 State A city 1 yes         2 0.354
#  2  2021 State A city 1 no          2 0.646
#  3  2021 State A city 2 yes         2 0.602
#  4  2021 State A city 2 no          2 0.398
#  5  2021 State B city 3 yes         2 0.823
#  6  2021 State B city 3 no          2 0.177
#  7  2021 State B city 4 yes         2 0.789
#  8  2021 State B city 4 no          2 0.211
#  9  2022 State A city 1 yes         2 0.354
# 10  2022 State A city 1 no          2 0.646
# 11  2022 State A city 2 yes         2 0.602
# 12  2022 State A city 2 no          2 0.398
# 13  2022 State B city 3 yes         2 0.823
# 14  2022 State B city 3 no          2 0.177
# 15  2022 State B city 4 yes         2 0.789
# 16  2022 State B city 4 no          2 0.211

Frankly, we don't "need" summarize, we can just mutate it in since the counts already appears to be aggregated.

my_data %>%
  group_by(year, state, city) %>%
  mutate(freq = counts / sum(counts)) %>%
  ungroup()
# # A tibble: 16 × 6
#    state   city   vaccine counts  year  freq
#    <chr>   <chr>  <chr>    <dbl> <dbl> <dbl>
#  1 State A city 1 yes       1221  2021 0.354
#  2 State A city 1 no        2233  2021 0.646
#  3 State A city 2 yes       1344  2021 0.602
#  4 State A city 2 no         887  2021 0.398
#  5 State B city 3 yes       9862  2021 0.823
#  6 State B city 3 no        2122  2021 0.177
#  7 State B city 4 yes       8772  2021 0.789
#  8 State B city 4 no        2341  2021 0.211
#  9 State A city 1 yes       1221  2022 0.354
# 10 State A city 1 no        2233  2022 0.646
# 11 State A city 2 yes       1344  2022 0.602
# 12 State A city 2 no         887  2022 0.398
# 13 State B city 3 yes       9862  2022 0.823
# 14 State B city 3 no        2122  2022 0.177
# 15 State B city 4 yes       8772  2022 0.789
# 16 State B city 4 no        2341  2022 0.211