Function or loop to repeat task-CodePudding

I'm trying to create a loop or a function for the dataset below.

For outcome1, outcome2, and outcome3 - I am trying to calculate the proportion by sex, age_group, income and education. In addition, I would like to create two separate outputs by the geography (area1 and area2).

Thanks in advance for the help!

library(tidyverse)
library(dplyr)

df <- data.frame (outcome1  = c("poor", "good", "excellent", "poor", "good", "poor", "poor", "excellent"),
                  outcome2 = c("good", "excellent", "excellent", "poor", "excellent", "poor", "excellent", "poor"),
                  outcome3 = c("poor", "poor", "excellent", "poor", "poor", "poor", "excellent", "good"),
                  sex = c("F", "M", "M", "F", "F", "M", "F", "M"),
                  age_group = c("50-54", "60-64", "80 ", "70-74", "40-44", "45-49", "60-64", "65-69"),
                  income = c("$<40,000", "$50,000-79,000", "$80,000-110,000", "$111,000 ", "$<40,000", "$<40,000", "$50,000-79,000", "$80,000-110,000"),
                  education = c("HS", "College", "Bachelors", "Masters", "HS", "College", "Bachelors", "Masters"),
                  geography= c("area1", "area2", "area1", "area2", "area2", "area1", "area2", "area1"))

                                
df_new <- df %>%
  group_by(outcome1, age_group) %>%
  summarise(count = n(),
            total = 8,
            proportion = count/total *100)

CodePudding user response：

It's possible without loops, perhaps most straightforwardly done through reshaping data, nesting each comparison and then unnesting out to the wider dataframe of all values:

library(tidyverse)

df |>
  pivot_longer(-starts_with("outcome"),
               names_to = "comparison",
               values_to = "category") |>
  nest(data = -comparison) |>
  mutate(props = map(data, function(df) {
    df |>
      pivot_longer(starts_with("outcome"),
                   names_to = "var",
                   values_to = "val") |>
      group_by(category, var, val) |>
      tally() |>
      mutate(prop = n / sum(n)) |>
      select(-n) |>
      pivot_wider(names_from = val, values_from = prop) |> 
      replace_na(list(good = 0, poor = 0, excellent = 0))
  })) |>
  select(-data) |> 
  unnest(props)

#> # A tibble: 57 × 6
#>    comparison category var       good  poor excellent
#>    <chr>      <chr>    <chr>    <dbl> <dbl>     <dbl>
#>  1 sex        F        outcome1  0.25  0.75      0   
#>  2 sex        F        outcome2  0.25  0.25      0.5 
#>  3 sex        F        outcome3  0     0.75      0.25
#>  4 sex        M        outcome1  0.25  0.25      0.5 
#>  5 sex        M        outcome2  0     0.5       0.5 
#>  6 sex        M        outcome3  0.25  0.5       0.25
#>  7 age_group  40-44    outcome1  1     0         0   
#>  8 age_group  40-44    outcome2  0     0         1   
#>  9 age_group  40-44    outcome3  0     1         0   
#> 10 age_group  45-49    outcome1  0     1         0   
#> # … with 47 more rows

This creates a dataframe of all comparisons, grouped by the categories within that comparison, and presenting the proportions of participants who recorded 'good', 'poor' or 'excellent' for each of the three outcomes.

The code may be a little hard to follow though! Essentially the steps are:

pivot comparison variables and values into columns and make into separate sub-dataframes for each of the independent variables (sex, age group, education, income, geography)
on each dataframe, pivot again to have outcome variables in a column and values in another
on each dataframe, do the calculations you've noted above for each of the three outcomes
pivot sub-dataframes back to have good/poor/excellent columns showing proportions (and replace NA values)
unnest into one long dataframe, though this last part might not be necessary if you'd rather use each of these sub-dataframes separately.

CodePudding user response：

df %>%
  pivot_longer(1:3, names_to = 'outcome', values_to = 'out_vals')%>%
  pivot_longer(-c(outcome, out_vals))%>%
  group_by(across(everything()))%>%
  summarise(prop = n() / 8 * 100, .groups = 'drop') %>%
  arrange(outcome, name, out_vals)

# A tibble: 90 × 5
   outcome  out_vals  name      value      prop
   <chr>    <chr>     <chr>     <chr>     <dbl>
 1 outcome1 excellent age_group 65-69      12.5
 2 outcome1 excellent age_group 80         12.5
 3 outcome1 good      age_group 40-44      12.5
 4 outcome1 good      age_group 60-64      12.5
 5 outcome1 poor      age_group 45-49      12.5
 6 outcome1 poor      age_group 50-54      12.5
 7 outcome1 poor      age_group 60-64      12.5
 8 outcome1 poor      age_group 70-74      12.5
 9 outcome1 excellent education Bachelors  12.5
10 outcome1 excellent education Masters    12.5

CodePudding user response：

Please try the below code

# split the dataframe by geography which create a list df2 with 2 lists
df2 <- df %>% group_split(geography) 

# get the outcome names into a vector
nam <- names(df)[which(str_detect(names(df),'outcome') )]

# create a dummy blank list which we can use later
dummy_list = list()

# pass the data into for loop 
for (i in 1:length(df2)){
  for (j in nam){
  df3 <- df2[[i]] %>% group_by(across(j), age_group, geography) %>% 
         summarise(count = n(), total = 8, proportion = count/total *100) %>% 
         mutate(group=j) %>% 
         rename(outcome=j)
  vec <- paste0(i,j)
  print(vec)
  dummy_list[[vec]] <- df3
}
}

# use the rbind to combind all the rows into a dataframe
fin <- do.call(rbind, dummy_list)

^{Created on 2023-02-02 with reprex v2.0.2}

# A tibble: 23 × 7
# Groups:   outcome, age_group [14]
   outcome   age_group geography count total proportion group   
   <chr>     <chr>     <chr>     <int> <dbl>      <dbl> <chr>   
 1 excellent 65-69     area1         1     8       12.5 outcome1
 2 excellent 80        area1         1     8       12.5 outcome1
 3 poor      45-49     area1         1     8       12.5 outcome1
 4 poor      50-54     area1         1     8       12.5 outcome1
 5 excellent 80        area1         1     8       12.5 outcome2
 6 good      50-54     area1         1     8       12.5 outcome2
 7 poor      45-49     area1         1     8       12.5 outcome2
 8 poor      65-69     area1         1     8       12.5 outcome2
 9 excellent 80        area1         1     8       12.5 outcome3
10 good      65-69     area1         1     8       12.5 outcome3
# … with 13 more rows
# ℹ Use `print(n = ...)` to see more rows