Home > Software design >  Percentage of total
Percentage of total

Time:08-16

I have a dataset in which customers ask information about certain countries. In one call the customer can ask about one country, or multiple. The countries are stored in a separate column, in case of more than one country they are separated by a comma.

My question is as follows. If I want to know the percentage of questions regarding a certain country in a certain year, what is the correct way to calculate this?

Do I separate the rows with multiple countries (increasing the number of observations), count the total, then count the total of a country and then calcultate the percentage?

df %>% 
  
  separate_rows(Country) %>% 
  
  group_by(Year) %>% 
  
  mutate(n.Total = n()) %>%
  
  group_by(Year, Country) %>% 
  
  mutate(n.Country = n()) %>% 
  
  group_by(Year) %>% 
  
  mutate(Percentage = n.Country/n.Total*100) %>%
  
  distinct(Country, .keep_all = T)

Or do I keep the original number of observations, count the occurance of country A per year, and then calculate the percentage?

df %>% 
  
 group_by(Year) %>%
  
  mutate(n.Total = n()) %>% 
  
  separate_rows(Country) %>% 
  
  group_by(Year, Country) %>% 
  
  mutate(n.Country = n()) %>% 
  
  group_by(Year) %>% 
  
  mutate(Percentage = n.Country/n.Total*100) %>%
  
  distinct(Country, .keep_all = T)

Apologies if this is a very basic question. For some reason I can't wrap my head around it.

Df:

structure(list(Year = c(2019, 2019, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020), Country = c("A", "B", "A,B", "C", "C", "A", "A, C", 
"B", "B,C", "A", "A,C", "C", "A", "B,C", "A", "B", "A", "B", 
"C", "B,C")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-20L))

CodePudding user response:

It may need a sequence column before we split up

library(dplyr)
library(tidyr)
df %>%
  mutate(rn = row_number()) %>% 
  separate_rows(Country) %>%
  group_by(Year) %>% 
  mutate(n.Total = n()) %>% 
  group_by(Country, .add = TRUE) %>%
  mutate(n.Country = n()) %>% 
  ungroup %>% 
  mutate(Percentage = n.Country/n.Total * 100) %>% 
  distinct(rn, .keep_all = TRUE)

CodePudding user response:

One option:

df |> 
  separate_rows(Country) |> 
  count(Year, Country, name = "nobs") |> 
  left_join(
    count(df, Year, name = "nobs_year"), 
    by = 'Year'
  ) |> 
  mutate(percentage = nobs / nobs_year)



#    Year Country  nobs nobs_year percentage
#   <dbl> <chr>   <int>     <int>      <dbl>
# 1  2019 A           5        10        0.5
# 2  2019 B           4        10        0.4
# 3  2019 C           4        10        0.4
# 4  2020 A           4        10        0.4
# 5  2020 B           4        10        0.4
# 6  2020 C           5        10        0.5
  • Related