I'm trying to create a loop or a function for the dataset below.
For overall_health, outlook_life, and health_mental- I am trying to calculate the proportion by sex, age_group, income and education. In addition, I would like to create two separate outputs by the geography (area1 and area2).
Please note that I do not want to use the pivot longer function.
Thanks in advance for the help!
Here's the a sample data frame (survey data):
library(tidyverse)
library(dplyr)
df <- data.frame (overall_health = c("poor", "good", "excellent", "poor", "good", "poor", "poor", "excellent"),
outlook_life = c("good", "excellent", "excellent", "poor", "excellent", "poor", "excellent", "poor"),
health_mental = c("poor", "poor", "excellent", "poor", "poor", "poor", "excellent", "good"),
sex = c("F", "M", "M", "F", "F", "M", "F", "M"),
age_group = c("50-54", "60-64", "80 ", "70-74", "40-44", "45-49", "60-64", "65-69"),
income = c("$<40,000", "$50,000-79,000", "$80,000-110,000", "$111,000 ", "$<40,000", "$<40,000", "$50,000-79,000", "$80,000-110,000"),
education = c("HS", "College", "Bachelors", "Masters", "HS", "College", "Bachelors", "Masters"),
geography= c("area1", "area2", "area1", "area2", "area2", "area1", "area2", "area1"))
Typically, one would calculate the proportion by age group as such:
df_new <- df %>%
group_by(overall_health, age_group, geography) %>%
summarise(count = n(),
total = 8,
proportion = count/total *100)
However, I would like to create a loop/function to calculate these portions for overall_health, outlook_life, and health_mental by: age_group, sex, income, education (in geography= "area1") I would like to do the same or geography= "area 2".
So for overall_health, outlook_life, and health_mental I would have multiple outputs by age_group, sex, income, education that look like this:
CodePudding user response:
library(tidyverse)
## unique geographies:
geos <- unique(df$geography)
## You can input combinations to a list to iterate over and label accordingly.
vars <- c("health_mental", "overall_health", "outlook_life")
## expand.grid will create every possible combination between the two vectors you supply it:
combinations <- expand.grid(c("age_group", "sex"), vars, stringsAsFactors = F)
combinations$label <- paste(combinations$Var1, combinations$Var2, sep = "_")
## list to output dataframe results to:
output <- list()
## Outer for loop will iterate through each of your geographies.
for (geo in geos){
## Inner loop will iterate through each of your combinations
for (combo in 1:nrow(combinations)){
## outputs are labelled based on combination geo and stored in output list.
output_label <- paste(combinations[[combo,"label"]],geo,sep="_")
temp <- df %>%
filter(geography == geo) %>%
group_by_at(combinations[combo,1:2] %>% unlist()%>%unname()) %>%
summarise(count = n(),
total = nrow(.),
proportion = count/nrow(.) *100)
output[[output_label]] <- temp
# You could also export each csv iteratively, with proper labelling automatically as well:
#write.csv(temp, paste0("insert_file_path", output_label, ".csv"))
}
}
Dataframes can be accessed through the lists by their list label:
## all dataframes:
output
## Overall Health x Age Group for Area 1:
output$overall_health_x_age_group_area1
## or
output[["overall_health_x_age_group_area1"]]