I have a dataset in which customers ask information about certain countries. In one call the customer can ask about one country, or multiple. The countries are stored in a separate column, in case of more than one country they are separated by a comma.
My question is as follows. If I want to know the percentage of questions regarding a certain country in a certain year, what is the correct way to calculate this?
Do I separate the rows with multiple countries (increasing the number of observations), count the total, then count the total of a country and then calcultate the percentage?
df %>%
separate_rows(Country) %>%
group_by(Year) %>%
mutate(n.Total = n()) %>%
group_by(Year, Country) %>%
mutate(n.Country = n()) %>%
group_by(Year) %>%
mutate(Percentage = n.Country/n.Total*100) %>%
distinct(Country, .keep_all = T)
Or do I keep the original number of observations, count the occurance of country A per year, and then calculate the percentage?
df %>%
group_by(Year) %>%
mutate(n.Total = n()) %>%
separate_rows(Country) %>%
group_by(Year, Country) %>%
mutate(n.Country = n()) %>%
group_by(Year) %>%
mutate(Percentage = n.Country/n.Total*100) %>%
distinct(Country, .keep_all = T)
Apologies if this is a very basic question. For some reason I can't wrap my head around it.
Df:
structure(list(Year = c(2019, 2019, 2019, 2019, 2019, 2019, 2019,
2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020), Country = c("A", "B", "A,B", "C", "C", "A", "A, C",
"B", "B,C", "A", "A,C", "C", "A", "B,C", "A", "B", "A", "B",
"C", "B,C")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-20L))
CodePudding user response:
It may need a sequence column before we split up
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(Country) %>%
group_by(Year) %>%
mutate(n.Total = n()) %>%
group_by(Country, .add = TRUE) %>%
mutate(n.Country = n()) %>%
ungroup %>%
mutate(Percentage = n.Country/n.Total * 100) %>%
distinct(rn, .keep_all = TRUE)
CodePudding user response:
One option:
df |>
separate_rows(Country) |>
count(Year, Country, name = "nobs") |>
left_join(
count(df, Year, name = "nobs_year"),
by = 'Year'
) |>
mutate(percentage = nobs / nobs_year)
# Year Country nobs nobs_year percentage
# <dbl> <chr> <int> <int> <dbl>
# 1 2019 A 5 10 0.5
# 2 2019 B 4 10 0.4
# 3 2019 C 4 10 0.4
# 4 2020 A 4 10 0.4
# 5 2020 B 4 10 0.4
# 6 2020 C 5 10 0.5