I'm trying to create a loop or a function for the dataset below.
For outcome1, outcome2, and outcome3 - I am trying to calculate the proportion by sex, age_group, income and education. In addition, I would like to create two separate outputs by the geography (area1 and area2).
Thanks in advance for the help!
library(tidyverse)
library(dplyr)
df <- data.frame (outcome1 = c("poor", "good", "excellent", "poor", "good", "poor", "poor", "excellent"),
outcome2 = c("good", "excellent", "excellent", "poor", "excellent", "poor", "excellent", "poor"),
outcome3 = c("poor", "poor", "excellent", "poor", "poor", "poor", "excellent", "good"),
sex = c("F", "M", "M", "F", "F", "M", "F", "M"),
age_group = c("50-54", "60-64", "80 ", "70-74", "40-44", "45-49", "60-64", "65-69"),
income = c("$<40,000", "$50,000-79,000", "$80,000-110,000", "$111,000 ", "$<40,000", "$<40,000", "$50,000-79,000", "$80,000-110,000"),
education = c("HS", "College", "Bachelors", "Masters", "HS", "College", "Bachelors", "Masters"),
geography= c("area1", "area2", "area1", "area2", "area2", "area1", "area2", "area1"))
df_new <- df %>%
group_by(outcome1, age_group) %>%
summarise(count = n(),
total = 8,
proportion = count/total *100)
CodePudding user response:
It's possible without loops, perhaps most straightforwardly done through reshaping data, nesting each comparison and then unnesting out to the wider dataframe of all values:
library(tidyverse)
df |>
pivot_longer(-starts_with("outcome"),
names_to = "comparison",
values_to = "category") |>
nest(data = -comparison) |>
mutate(props = map(data, function(df) {
df |>
pivot_longer(starts_with("outcome"),
names_to = "var",
values_to = "val") |>
group_by(category, var, val) |>
tally() |>
mutate(prop = n / sum(n)) |>
select(-n) |>
pivot_wider(names_from = val, values_from = prop) |>
replace_na(list(good = 0, poor = 0, excellent = 0))
})) |>
select(-data) |>
unnest(props)
#> # A tibble: 57 × 6
#> comparison category var good poor excellent
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 sex F outcome1 0.25 0.75 0
#> 2 sex F outcome2 0.25 0.25 0.5
#> 3 sex F outcome3 0 0.75 0.25
#> 4 sex M outcome1 0.25 0.25 0.5
#> 5 sex M outcome2 0 0.5 0.5
#> 6 sex M outcome3 0.25 0.5 0.25
#> 7 age_group 40-44 outcome1 1 0 0
#> 8 age_group 40-44 outcome2 0 0 1
#> 9 age_group 40-44 outcome3 0 1 0
#> 10 age_group 45-49 outcome1 0 1 0
#> # … with 47 more rows
This creates a dataframe of all comparisons, grouped by the categories within that comparison, and presenting the proportions of participants who recorded 'good', 'poor' or 'excellent' for each of the three outcomes.
The code may be a little hard to follow though! Essentially the steps are:
- pivot comparison variables and values into columns and make into separate sub-dataframes for each of the independent variables (sex, age group, education, income, geography)
- on each dataframe, pivot again to have outcome variables in a column and values in another
- on each dataframe, do the calculations you've noted above for each of the three outcomes
- pivot sub-dataframes back to have good/poor/excellent columns showing proportions (and replace NA values)
- unnest into one long dataframe, though this last part might not be necessary if you'd rather use each of these sub-dataframes separately.
CodePudding user response:
df %>%
pivot_longer(1:3, names_to = 'outcome', values_to = 'out_vals')%>%
pivot_longer(-c(outcome, out_vals))%>%
group_by(across(everything()))%>%
summarise(prop = n() / 8 * 100, .groups = 'drop') %>%
arrange(outcome, name, out_vals)
# A tibble: 90 × 5
outcome out_vals name value prop
<chr> <chr> <chr> <chr> <dbl>
1 outcome1 excellent age_group 65-69 12.5
2 outcome1 excellent age_group 80 12.5
3 outcome1 good age_group 40-44 12.5
4 outcome1 good age_group 60-64 12.5
5 outcome1 poor age_group 45-49 12.5
6 outcome1 poor age_group 50-54 12.5
7 outcome1 poor age_group 60-64 12.5
8 outcome1 poor age_group 70-74 12.5
9 outcome1 excellent education Bachelors 12.5
10 outcome1 excellent education Masters 12.5
CodePudding user response:
Please try the below code
# split the dataframe by geography which create a list df2 with 2 lists
df2 <- df %>% group_split(geography)
# get the outcome names into a vector
nam <- names(df)[which(str_detect(names(df),'outcome') )]
# create a dummy blank list which we can use later
dummy_list = list()
# pass the data into for loop
for (i in 1:length(df2)){
for (j in nam){
df3 <- df2[[i]] %>% group_by(across(j), age_group, geography) %>%
summarise(count = n(), total = 8, proportion = count/total *100) %>%
mutate(group=j) %>%
rename(outcome=j)
vec <- paste0(i,j)
print(vec)
dummy_list[[vec]] <- df3
}
}
# use the rbind to combind all the rows into a dataframe
fin <- do.call(rbind, dummy_list)
Created on 2023-02-02 with reprex v2.0.2
# A tibble: 23 × 7
# Groups: outcome, age_group [14]
outcome age_group geography count total proportion group
<chr> <chr> <chr> <int> <dbl> <dbl> <chr>
1 excellent 65-69 area1 1 8 12.5 outcome1
2 excellent 80 area1 1 8 12.5 outcome1
3 poor 45-49 area1 1 8 12.5 outcome1
4 poor 50-54 area1 1 8 12.5 outcome1
5 excellent 80 area1 1 8 12.5 outcome2
6 good 50-54 area1 1 8 12.5 outcome2
7 poor 45-49 area1 1 8 12.5 outcome2
8 poor 65-69 area1 1 8 12.5 outcome2
9 excellent 80 area1 1 8 12.5 outcome3
10 good 65-69 area1 1 8 12.5 outcome3
# … with 13 more rows
# ℹ Use `print(n = ...)` to see more rows