I have a loop in R that loops over columns in a dataframe las_ref
for those columns where the name matches a value in a vector las_names
.
As the original las_ref
dataframe has rows with duplicated incidentid
values, I am aiming to extract the most common value in each of the columns in las_names
for each incidentid
, produce a new column and use a join to produce a dataframe las_ref3
with the most common values in every row.
Structure of las_ref
incidentid incident var1
001 abc 45
002 abc NA
002 NA 78
003 def 12
004 xyz NA
004 xyz 10
004 abc 10
Intended structure of las_ref3
incidentid incident var1 incident-new var1-new
001 abc 45 abc 45
002 abc NA abc 78
002 NA 78 abc 78
003 def 12 def 12
004 xyz NA xyz 10
004 xyz 10 xyz 10
004 abs 10 xyz 10
Below is the loop I am trying to use.
las_names <- c("incident","var1")
for(i in las_names) {
las_ref2 <- las_ref %>%
group_by(across(c(incidentid, paste(i)))) %>%
tally() %>%
filter(!is.na(paste(i))) %>%
arrange(incidentid, desc(n)) %>%
summarize(paste(i,"-new") = first(paste(i)))
las_ref3 <- las_ref %>%
left_join(select(las_ref2, incidentid, paste(i,"-new")), by = c("incidentid"))
}
Currently there seem to be two issues that I think revolve around the use of i
to refer to the columns incident
and var
in the examples above.
The first is that the !is.na()
function is not removing the NA values in the columns referred to by i
, although this does not produce an error.
The second, which does produce an error, is the summarise line. I get an unexpected '=' in
error when this line is run.
Neither issue occurs when I run the code outside of the loop, specifying column names individually - the result comes out as expected. As the dataset is fairly large, I was hoping to use the loop to avoid doing each column individually.
CodePudding user response:
We could write your loop in the following way to make it work:
library(tidyverse)
las_ref <- tribble(~incidentid , ~incident, ~var1,
"001", "abc", 45,
"002", "abc", NA,
"002", NA, 78,
"003", "def", 12,
"004", "xyz", NA,
"004", "xyz", 10,
"004", "abc", 10
)
las_names <- c("incident","var1")
las_ref3 <- las_ref
for(i in las_names) {
las_ref2 <- las_ref %>%
group_by(across(c(incidentid, paste(i)))) %>%
tally() %>%
filter(!is.na(!! sym(paste(i)))) %>%
arrange(incidentid, desc(n)) %>%
summarize("{i}-new" := first(!! sym(paste(i))))
las_ref3 <- las_ref3 %>%
left_join(select(las_ref2, incidentid, !! sym(paste0(i,"-new"))), by = c("incidentid"))
}
las_ref3
#> # A tibble: 7 x 5
#> incidentid incident var1 `incident-new` `var1-new`
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 001 abc 45 abc 45
#> 2 002 abc NA abc 78
#> 3 002 <NA> 78 abc 78
#> 4 003 def 12 def 12
#> 5 004 xyz NA xyz 10
#> 6 004 xyz 10 xyz 10
#> 7 004 abc 10 xyz 10
We can also use purrr::map_dfc
within mutate
together with purrr:set_names
:
las_ref %>%
group_by(incidentid) %>%
mutate(map_dfc(set_names(las_names, paste0(las_names, "-new")),
~ count(cur_data_all(), "{.x}" := eval(sym(.x))) %>%
arrange(desc(n)) %>%
slice_head() %>%
pull(eval(sym(.x)))
)
)
#> # A tibble: 7 x 5
#> # Groups: incidentid [4]
#> incidentid incident var1 `incident-new` `var1-new`
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 001 abc 45 abc 45
#> 2 002 abc NA abc 78
#> 3 002 <NA> 78 abc 78
#> 4 003 def 12 def 12
#> 5 004 xyz NA xyz 10
#> 6 004 xyz 10 xyz 10
#> 7 004 abc 10 xyz 10
I have a package on github, {dplyover}, which makes the approach above a bit simpler by using over
and its .names
argument:
library(dplyover) # https://github.com/TimTeaFan/dplyover
las_ref %>%
group_by(incidentid) %>%
mutate(over(las_names,
~ count(cur_data_all(), "{.x}" := eval(sym(.x))) %>%
arrange(desc(n)) %>%
slice_head() %>%
pull(eval(sym(.x))),
.names = "{x}-new")
)
#> # A tibble: 7 x 5
#> # Groups: incidentid [4]
#> incidentid incident var1 `incident-new` `var1-new`
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 001 abc 45 abc 45
#> 2 002 abc NA abc 78
#> 3 002 <NA> 78 abc 78
#> 4 003 def 12 def 12
#> 5 004 xyz NA xyz 10
#> 6 004 xyz 10 xyz 10
#> 7 004 abc 10 xyz 10
Created on 2022-02-11 by the reprex package (v2.0.1)
CodePudding user response:
A version that avoids manual loops:
library(tidyverse)
las_ref <- tribble(~incidentid , ~incident, ~var1,
"001", "abc", 45,
"002", "abc", NA,
"002", NA, 78,
"003", "def", 12,
"004", "xyz", NA,
"004", "xyz", 10,
"004", "abc", 10
)
las_ref3 <- las_ref %>%
group_by(incidentid) %>%
mutate(
across(c(incident, var1), ~names(sort(table(.x, useNA = 'no'), decreasing = T))[1], .names = '{.col}-new' )
)
incidentid incident var1 incident-new var1-new
<chr> <chr> <dbl> <chr> <chr>
1 001 abc 45 abc 45
2 002 abc NA abc 78
3 002 NA 78 abc 78
4 003 def 12 def 12
5 004 xyz NA xyz 10
6 004 xyz 10 xyz 10
7 004 abc 10 xyz 10