Referring to columns by name in for loop-CodePudding

I have a loop in R that loops over columns in a dataframe las_ref for those columns where the name matches a value in a vector las_names. As the original las_ref dataframe has rows with duplicated incidentid values, I am aiming to extract the most common value in each of the columns in las_names for each incidentid, produce a new column and use a join to produce a dataframe las_ref3 with the most common values in every row.

Structure of las_ref

incidentid incident var1
001             abc   45
002             abc   NA
002              NA   78
003             def   12
004             xyz   NA
004             xyz   10
004             abc   10

Intended structure of las_ref3

incidentid incident var1 incident-new var1-new
001             abc   45          abc       45
002             abc   NA          abc       78
002              NA   78          abc       78
003             def   12          def       12 
004             xyz   NA          xyz       10
004             xyz   10          xyz       10
004             abs   10          xyz       10

Below is the loop I am trying to use.

las_names <- c("incident","var1")
for(i in las_names) {
  las_ref2 <- las_ref %>%
    group_by(across(c(incidentid, paste(i)))) %>%
    tally() %>%
    filter(!is.na(paste(i))) %>%
    arrange(incidentid, desc(n)) %>%
    summarize(paste(i,"-new") = first(paste(i)))
  las_ref3 <- las_ref %>%
    left_join(select(las_ref2, incidentid, paste(i,"-new")), by = c("incidentid"))
}

Currently there seem to be two issues that I think revolve around the use of i to refer to the columns incident and var in the examples above.

The first is that the !is.na() function is not removing the NA values in the columns referred to by i, although this does not produce an error. The second, which does produce an error, is the summarise line. I get an unexpected '=' in error when this line is run.

Neither issue occurs when I run the code outside of the loop, specifying column names individually - the result comes out as expected. As the dataset is fairly large, I was hoping to use the loop to avoid doing each column individually.

CodePudding user response：

We could write your loop in the following way to make it work:

library(tidyverse)

las_ref <- tribble(~incidentid , ~incident, ~var1,
    "001",             "abc",   45,
    "002",             "abc",   NA,
    "002",              NA,     78,
    "003",             "def",   12,
    "004",             "xyz",   NA,
    "004",             "xyz",   10,
    "004",             "abc",   10
)

las_names <- c("incident","var1")

las_ref3 <- las_ref

for(i in las_names) {
  las_ref2 <- las_ref %>%
    group_by(across(c(incidentid, paste(i)))) %>%
    tally() %>%
    filter(!is.na(!! sym(paste(i)))) %>%
    arrange(incidentid, desc(n)) %>%
    summarize("{i}-new" := first(!! sym(paste(i))))
  las_ref3 <- las_ref3 %>%
    left_join(select(las_ref2, incidentid, !! sym(paste0(i,"-new"))), by = c("incidentid"))
}

las_ref3
#> # A tibble: 7 x 5
#>   incidentid incident  var1 `incident-new` `var1-new`
#>   <chr>      <chr>    <dbl> <chr>               <dbl>
#> 1 001        abc         45 abc                    45
#> 2 002        abc         NA abc                    78
#> 3 002        <NA>        78 abc                    78
#> 4 003        def         12 def                    12
#> 5 004        xyz         NA xyz                    10
#> 6 004        xyz         10 xyz                    10
#> 7 004        abc         10 xyz                    10

We can also use purrr::map_dfc within mutate together with purrr:set_names:

las_ref %>% 
  group_by(incidentid) %>% 
  mutate(map_dfc(set_names(las_names, paste0(las_names, "-new")),
              ~ count(cur_data_all(), "{.x}" := eval(sym(.x))) %>%
                arrange(desc(n)) %>%
                slice_head() %>%
                pull(eval(sym(.x)))
  )
  )

#> # A tibble: 7 x 5
#> # Groups:   incidentid [4]
#>   incidentid incident  var1 `incident-new` `var1-new`
#>   <chr>      <chr>    <dbl> <chr>               <dbl>
#> 1 001        abc         45 abc                    45
#> 2 002        abc         NA abc                    78
#> 3 002        <NA>        78 abc                    78
#> 4 003        def         12 def                    12
#> 5 004        xyz         NA xyz                    10
#> 6 004        xyz         10 xyz                    10
#> 7 004        abc         10 xyz                    10

I have a package on github, {dplyover}, which makes the approach above a bit simpler by using over and its .names argument:

library(dplyover) # https://github.com/TimTeaFan/dplyover

las_ref %>% 
  group_by(incidentid) %>% 
  mutate(over(las_names,
              ~ count(cur_data_all(), "{.x}" := eval(sym(.x))) %>%
                arrange(desc(n)) %>%
                slice_head() %>%
                pull(eval(sym(.x))),
              .names = "{x}-new")
  )

#> # A tibble: 7 x 5
#> # Groups:   incidentid [4]
#>   incidentid incident  var1 `incident-new` `var1-new`
#>   <chr>      <chr>    <dbl> <chr>               <dbl>
#> 1 001        abc         45 abc                    45
#> 2 002        abc         NA abc                    78
#> 3 002        <NA>        78 abc                    78
#> 4 003        def         12 def                    12
#> 5 004        xyz         NA xyz                    10
#> 6 004        xyz         10 xyz                    10
#> 7 004        abc         10 xyz                    10

^{Created on 2022-02-11 by the reprex package (v2.0.1)}

CodePudding user response：

A version that avoids manual loops:

library(tidyverse)

las_ref <- tribble(~incidentid , ~incident, ~var1,
                   "001",             "abc",   45,
                   "002",             "abc",   NA,
                   "002",              NA,     78,
                   "003",             "def",   12,
                   "004",             "xyz",   NA,
                   "004",             "xyz",   10,
                   "004",             "abc",   10
)

las_ref3 <- las_ref %>% 
  group_by(incidentid) %>% 
  mutate(
    across(c(incident, var1), ~names(sort(table(.x, useNA = 'no'), decreasing = T))[1], .names = '{.col}-new' )
  )

  incidentid incident  var1 incident-new var1-new
  <chr>      <chr>    <dbl> <chr>        <chr>   
1 001        abc         45 abc          45      
2 002        abc         NA abc          78      
3 002        NA          78 abc          78      
4 003        def         12 def          12      
5 004        xyz         NA xyz          10      
6 004        xyz         10 xyz          10      
7 004        abc         10 xyz          10