I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1
with the data entered by one person, and column.2
where the same data was entered by a different person. I want to create a "master" column called simply column
that first draws from column.1
and then, if column.1
is NA
, draws from column.2
.
Here is an example of what I am trying to do with made-up data:
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
For example, the code below shows an example of what I want to do for a single pair of columns.
mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))
#OUTPUT:
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list
) of the column pairs for which I want to create new "master" columns:
col.list = c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
#OUTPUT:
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[[2]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 2
3 Carmen NA 3 0 1 <NA> yum 0
4 Dinah 6 5 4 4 gross <NA> 4
5 Billie NA NA NA 2 <NA> gross 2
[[3]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1 Sarah 1 1 NA 1 yum gross yum
2 Ella 4 4 2 2 yum yum yum
3 Carmen NA 3 0 1 <NA> yum yum
4 Dinah 6 5 4 4 gross <NA> gross
5 Billie NA NA NA 2 <NA> gross gross
The problems with this attempt are:
- the new columns are not correctly named (they should be named
cheese
,milk
andtofu
rather than all be calledv
) - the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in
col.list
).
Thanks so much for any help!
CodePudding user response:
(1) You have to wrap v
into the curly-curly operator and use :=
:
library(dplyr)
col.list <- c("cheese","milk","tofu")
lapply(col.list, FUN = function(x) {
v <- as.name({{x}})
v.1 <- as.name(paste0({{x}}, ".1"))
v.2 <- as.name(paste0(({{x}}), ".2"))
mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})
returns
[[1]]
name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1 Sarah 1 1 NA 1 yum gross 1
2 Ella 4 4 2 2 yum yum 4
3 Carmen NA 3 0 1 <NA> yum 3
4 Dinah 6 5 4 4 gross <NA> 6
5 Billie NA NA NA 2 <NA> gross NA
[...]
which is one step closer to your desired output.
(2) But to get your desired output, I suggest using purrr
:
library(purrr)
library(dplyr)
col.list %>%
map(~mydata %>%
select(name, starts_with(.x)) %>%
mutate({{ .x }} := ifelse(
is.na(!!sym(paste0(.x, ".1"))),
!!sym(paste0(.x, ".2")),
!!sym(paste0(.x, ".1"))
)
)
) %>%
reduce(left_join, by = "name")
This returns
name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2 tofu
1 Sarah 1 1 1 NA 1 1 yum gross yum
2 Ella 4 4 4 2 2 2 yum yum yum
3 Carmen NA 3 3 0 1 0 <NA> yum yum
4 Dinah 6 5 6 4 4 4 gross <NA> gross
5 Billie NA NA NA NA 2 2 <NA> gross gross
CodePudding user response:
Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1
and 2
library(dplyr)
library(tidyr)
mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
cheese.1 = c(1,4,NA,6,NA),
cheese.2 = c(1,4,3,5,NA),
milk.1 = c(NA,2,0,4,NA),
milk.2 = c(1,2,1,4,2),
tofu.1 = c("yum","yum",NA,"gross", NA),
tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>%
mutate(across(where(is.numeric), as.character)) %>%
pivot_longer(-name,
names_to = c("food", "nr"),
names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#> name food nr value
#> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1
#> 2 Sarah cheese 2 1
#> 3 Sarah milk 1 <NA>
#> 4 Sarah milk 2 1
#> 5 Sarah tofu 1 yum
#> 6 Sarah tofu 2 gross
#> 7 Ella cheese 1 4
#> 8 Ella cheese 2 4
#> 9 Ella milk 1 2
#> 10 Ella milk 2 2
#> # ... with 20 more rows
Apply ifelse()
function after transforming back to different wide format
mydata_wide <- mydata_long %>%
pivot_wider(names_from = nr,
values_from = value) %>%
mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>%
arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#> name food `1` `2` final_val
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Sarah cheese 1 1 1
#> 2 Ella cheese 4 4 4
#> 3 Carmen cheese <NA> 3 3
#> 4 Dinah cheese 6 5 6
#> 5 Billie cheese <NA> <NA> <NA>
#> 6 Sarah milk <NA> 1 1
#> 7 Ella milk 2 2 2
#> 8 Carmen milk 0 1 0
#> 9 Dinah milk 4 4 4
#> 10 Billie milk <NA> 2 2
#> 11 Sarah tofu yum gross yum
#> 12 Ella tofu yum yum yum
#> 13 Carmen tofu <NA> yum yum
#> 14 Dinah tofu gross <NA> gross
#> 15 Billie tofu <NA> gross gross
mydata_wide2 <- mydata_wide %>%
pivot_wider(-c(`1`, `2`),
names_from = food,
values_from = final_val)
mydata_wide2
#> # A tibble: 5 x 4
#> name cheese milk tofu
#> <chr> <chr> <chr> <chr>
#> 1 Sarah 1 1 yum
#> 2 Ella 4 2 yum
#> 3 Carmen 3 0 yum
#> 4 Dinah 6 4 gross
#> 5 Billie <NA> 2 gross
Created on 2021-10-29 by the reprex package (v2.0.1)
CodePudding user response:
I would use purrr::map_dfc
and coalesce
here. Looks pretty straightforward.
library(purrr)
library(dplyr)
library(stringr)
mydata %>% mutate(map2_dfc(select(., ends_with('1')),
select(., ends_with('2')),
~coalesce(.x, .y)))%>%
select(-ends_with('2'))%>%
rename_with(~str_remove(.x, '\\.\\d $'))
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
CodePudding user response:
Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.
library(tidyverse)
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)
Which gives
# A tibble: 5 x 4
name cheese milk tofu
<chr> <dbl> <dbl> <chr>
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
Alternative solution with coalesce
:
mydata %>%
pivot_longer(cols = -name,
names_pattern = '(.*)(\\..)',
names_to = c('.value', 'number')) %>%
group_by(name) %>%
mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
ungroup() %>%
filter(number == '.1') %>%
select(-number)
CodePudding user response:
Here is how you can achieve your task:
- define your pairs (in case you have hundreds of columns, this could be automated.
- use
imap_dfc
to applycoalesce
do the defined pairs - bind to original dataframe
library(dplyr)
library(purrr)
pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))
imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>%
bind_cols(mydata)
cheese milk tofu name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1 1 1 yum Sarah 1 1 NA 1 yum gross
2 4 2 yum Ella 4 4 2 2 yum yum
3 3 0 yum Carmen NA 3 0 1 <NA> yum
4 6 4 gross Dinah 6 5 4 4 gross <NA>
5 NA 2 gross Billie NA NA NA 2 <NA> gross
CodePudding user response:
Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1
, cheese.2
, and cheese.3
) this will still work. This will also work if the columns are unbalanced (eg 3 cheese columns, but only 2 milk columns):
library(purrr)
library(stringr)
library(rlang)
str_exp <- map(set_names(col.list), ~ str_glue("exec(coalesce, !!!across(starts_with(\"{.x}\")))"))
express <- map(str_exp, parse_expr)
mydata %>%
mutate(!!! express,
.keep = "unused")
Output
The other columns were removed by .keep = "unused"
. If you want to keep all the columns then delete that argument.
name cheese milk tofu
1 Sarah 1 1 yum
2 Ella 4 2 yum
3 Carmen 3 0 yum
4 Dinah 6 4 gross
5 Billie NA 2 gross
How it works
- The use of
map
andset_names
is important because this creates a named list, which is important for the big-bang!!!
operator later. The firstmap
creates a named list of string representations of expressions. - The use of
exec
andcoalesce
allows the dynamic tidy-selection of columns. - The second
map
turns each of these elements into an expression. As mentioned in the comments by @MartinGal you could combine this into onemap
with a pipe, but I broke it out so it was more readable. - The
!!!
operator force-splices the list of objects and the names for the columns are from the list names set up usingmap
andset_names
.