Home > Net >  Referencing variable names in loops for dplyr
Referencing variable names in loops for dplyr

Time:02-02

I know this has been discussed already, but can't find a solution that works for me. I have several binary (0/1) variables named "indic___1" to "indic___8" and one continuous variable "measure".

I would like to compute summary statistics for "measure" across each group, so I created this code:

library(dplyr)
indic___1 <- c(0, 1, 0, 1, 0)
indic___2 <- c(1, 1, 0, 1, 1)
indic___3 <- c(0, 0, 1, 0, 0)
indic___4 <- c(1, 1, 0, 1, 0)
indic___5 <- c(0, 0, 0, 1, 1)
indic___6 <- c(0, 1, 1, 1, 0)
indic___7 <- c(1, 1, 0, 1, 1)
indic___8 <- c(0, 1, 1, 1, 0)
measure <- c(28, 15, 26, 42, 12)

dataset <- data.frame(indic___1, indic___2, indic___3, indic___4, indic___5, indic___6, indic___7, indic___8, measure)

for (i in 1:8) {
  variable <- paste0("indic___", i)
  print(variable)
  dataset %>% group_by(variable) %>% summarise(mean = mean(measure))
}

It returns an error:

Error in `group_by()`:
! Must group by variables found in `.data`.
x Column `variable` is not found.

CodePudding user response:

Putting data into long format makes this generally solvable without a loop. You didn’t specify what you wanted to do with the data inside the loop so I had to guess, but the general form of the solution would look as follows:

results = dataset |>
    pivot_longer(starts_with("indic___"), names_pattern = "indic___(.*)") |>
    group_by(name, value) |>
    summarize(mean = mean(measure), .groups = "drop")

# # A tibble: 16 × 3
#    name  value  mean
#    <chr> <dbl> <dbl>
#  1 1         0  22
#  2 1         1  28.5
#  3 2         0  26
#  4 2         1  24.2
#  5 3         0  24.2
# …

If you want to separate the results from the individual names, you can use a combination of nest and pull:

results |>
    nest(data = c(value, mean), .by = name) |>
    pull(data)

# [[1]]
# # A tibble: 2 × 2
#   value  mean
#   <dbl> <dbl>
# 1     0  22
# 2     1  28.5
#
# [[2]]
# # A tibble: 2 × 2
#   value  mean
#   <dbl> <dbl>
# 1     0  26
# 2     1  24.2
# …

… but at this point I’d ask myself why I am using table manipulation in the first place. The following seems a lot easier:

indices = unname(mget(ls(pattern = "^indic___")))
results = indices |>
    lapply(split, x = measure) |>
    lapply(vapply, mean, numeric(1L))

# [[1]]
#    0    1
# 22.0 28.5
#
# [[2]]
#     0     1
# 26.00 24.25
# …

Notably, in real code you shouldn’t need the first line since your data should not be in individual, numbered variables in the first place. The proper way to do this is to have the data in a joint list, as is done here. Also, note that I once again explicitly removed the unreadable indic___X names. You can of course retain them (just remove the unname call) but I don’t recommend it.

  • Related