I have the following dataset:
my_data = structure(list(state = c("State A", "State A", "State A", "State A",
"State B", "State B", "State B", "State B", "State A", "State A",
"State A", "State A", "State B", "State B", "State B", "State B"
), city = c("city 1", "city 1", "city 2", "city 2", "city 3",
"city 3", "city 4", "city 4", "city 1", "city 1", "city 2", "city 2",
"city 3", "city 3", "city 4", "city 4"), vaccine = c("yes", "no",
"yes", "no", "yes", "no", "yes", "no", "yes", "no", "yes", "no",
"yes", "no", "yes", "no"), counts = c(1221, 2233, 1344, 887,
9862, 2122, 8772, 2341, 1221, 2233, 1344, 887, 9862, 2122, 8772,
2341), year = c(2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022)), row.names = c(NA,
-16L), class = "data.frame")
My Question: For each city at each year, I want to find out the percent of people who took a vaccine.
The final result would look something like this (I just made up some numbers):
state city vaccine Relative_Percentage year
1 State A city 1 yes 0.6 2021
2 State A city 1 no 0.4 2021
3 State A city 2 yes 0.3 2021
4 State A city 2 no 0.7 2021
Using this post as an example (Relative frequencies / proportions with dplyr), I tried the following code:
library(dplyr)
my_data %>%
group_by(year, state, city, vaccine) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
But I don't think my code is correct - all percentages are exactly 0.5
`summarise()` has grouped output by 'year', 'state', 'city'. You can override using the `.groups` argument.
# A tibble: 16 x 6
# Groups: year, state, city [8]
year state city vaccine n freq
<dbl> <chr> <chr> <chr> <int> <dbl>
1 2021 State A city 1 no 1 0.5
2 2021 State A city 1 yes 1 0.5
Can someone please show me how to fix this problem?
Thanks!
CodePudding user response:
For each city at each year, I want to find out the percent of people who took a vaccine.
Don't include vaccine
in your grouping. You can keep state
in your group to differentiate city
s. Also, if you want percentage of counts
, then you need to calculate that in the summarize
; trying to look at it afterwards is not possible since you've dropped counts
. Trying to use n
in your calculation of freq
is only calculating the percentage of rows in the database, not the percent of people who took a vaccine.
And since you want to know which vaccine has which frequency, just add it to the summary.
my_data %>%
group_by(year, state, city) %>%
summarise(vaccine, n = n(), freq = counts / sum(counts), .groups = "drop")
# # A tibble: 16 × 6
# year state city vaccine n freq
# <dbl> <chr> <chr> <chr> <int> <dbl>
# 1 2021 State A city 1 yes 2 0.354
# 2 2021 State A city 1 no 2 0.646
# 3 2021 State A city 2 yes 2 0.602
# 4 2021 State A city 2 no 2 0.398
# 5 2021 State B city 3 yes 2 0.823
# 6 2021 State B city 3 no 2 0.177
# 7 2021 State B city 4 yes 2 0.789
# 8 2021 State B city 4 no 2 0.211
# 9 2022 State A city 1 yes 2 0.354
# 10 2022 State A city 1 no 2 0.646
# 11 2022 State A city 2 yes 2 0.602
# 12 2022 State A city 2 no 2 0.398
# 13 2022 State B city 3 yes 2 0.823
# 14 2022 State B city 3 no 2 0.177
# 15 2022 State B city 4 yes 2 0.789
# 16 2022 State B city 4 no 2 0.211
Frankly, we don't "need" summarize
, we can just mutate it in since the counts already appears to be aggregated.
my_data %>%
group_by(year, state, city) %>%
mutate(freq = counts / sum(counts)) %>%
ungroup()
# # A tibble: 16 × 6
# state city vaccine counts year freq
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 State A city 1 yes 1221 2021 0.354
# 2 State A city 1 no 2233 2021 0.646
# 3 State A city 2 yes 1344 2021 0.602
# 4 State A city 2 no 887 2021 0.398
# 5 State B city 3 yes 9862 2021 0.823
# 6 State B city 3 no 2122 2021 0.177
# 7 State B city 4 yes 8772 2021 0.789
# 8 State B city 4 no 2341 2021 0.211
# 9 State A city 1 yes 1221 2022 0.354
# 10 State A city 1 no 2233 2022 0.646
# 11 State A city 2 yes 1344 2022 0.602
# 12 State A city 2 no 887 2022 0.398
# 13 State B city 3 yes 9862 2022 0.823
# 14 State B city 3 no 2122 2022 0.177
# 15 State B city 4 yes 8772 2022 0.789
# 16 State B city 4 no 2341 2022 0.211