Using ddply in combo with weighted.mean in a for loop with dynamic variables-CodePudding

my dataset looks like this:

structure(list(GEOLEV2 = structure(c("768001001", "768001001", 
"768001002", "768001002", "768001006", "768001006", "768001002", 
"768001002", "768001002", "768001002", "768002016", "768002016"
), format.stata = "%9s"), DHSYEAR = structure(c(1988, 1988, 1988, 
1988, 1998, 1998, 1998, 1998, 2013, 2013, 2013, 2013), format.stata = "%9.0g"), 
    v005 = structure(c(1e 06, 1e 06, 1e 06, 1e 06, 1815025, 1815025, 
    1517492, 1517492, 1350366, 1350366, 617033, 617033), format.stata = "%9.0g"), 
    age = structure(c(37, 22, 18, 46, 15, 29, 18, 42, 19, 15, 
    35, 16), format.stata = "%9.0g"), highest_year_edu = structure(c(2, 
    6, NA, NA, 5, NA, 2, 3, 2, NA, 5, 3), format.stata = "%9.0g")), row.names = c(NA, 
-12L), class = c("tbl_df", "tbl", "data.frame"), label = "Written by R")

I want to collapse it on a df1$GEOLEV2/df1$DHSYEAR basis, with weighted.mean as the collapsing function. Each variable shall remain with the same name.

I chose the function ddply and when I try it on a single variable, it works:

ddply(df1, ~ df1$GEOLEV2  df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))

However, when I build the loop, the function returns me an error. My trial was:

df1_collapsed <- ddply(df1, ~ df1$GEOLEV2  df1$DHSYEAR, summarise, age = weighted.mean(age, v005, na.rm = TRUE))

for (i in names(df1[4,5)) {
  variable <- ddply(df1, ~ df1$GEOLEV2  df1$DHSYEAR, summarise, i = weighted.mean(i, v005, na.rm = TRUE))
  df1_collapsed <- left_join(df1_collapsed, variable, by = c("df1$GEOLEV2", "df1$DHSYEAR"))
}

and the error is

Error in weighted.mean.default(i, v005, na.rm = TRUE) : 
  'x' and 'w' must have the same length

How can I build the for loop, embedding the variable name in the loop?

CodePudding user response：

In general in R you don't need loops for grouping and summarising (which you would call collapsing in Stata). You can use dplyr for this type of operation:

df1  %>% 
    group_by(GEOLEV2, DHSYEAR)  %>% 
    summarise(
        across(age:highest_year_edu, ~ weighted.mean(.x, v005, na.rm = TRUE))
    )


# A tibble: 6 x 4
# Groups:   GEOLEV2 [4]
#   GEOLEV2   DHSYEAR   age highest_year_edu
#   <chr>       <dbl> <dbl>            <dbl>
# 1 768001001    1988  29.5              4
# 2 768001002    1988  32              NaN
# 3 768001002    1998  30                2.5
# 4 768001002    2013  17                2
# 5 768001006    1998  22                5
# 6 768002016    2013  25.5              4