Home > Mobile >  R: How can I apply a custom function that adds new columns to a dataframe to a subset of existing co
R: How can I apply a custom function that adds new columns to a dataframe to a subset of existing co

Time:10-30

I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.

Here is an example of what I am trying to do with made-up data:

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))

For example, the code below shows an example of what I want to do for a single pair of columns.

mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))

#OUTPUT:

    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:

col.list = c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

#OUTPUT:

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2  v
1  Sarah        1        1     NA      1    yum  gross  1
2   Ella        4        4      2      2    yum    yum  4
3 Carmen       NA        3      0      1   <NA>    yum  3
4  Dinah        6        5      4      4  gross   <NA>  6
5 Billie       NA       NA     NA      2   <NA>  gross NA

[[2]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1  Sarah        1        1     NA      1    yum  gross 1
2   Ella        4        4      2      2    yum    yum 2
3 Carmen       NA        3      0      1   <NA>    yum 0
4  Dinah        6        5      4      4  gross   <NA> 4
5 Billie       NA       NA     NA      2   <NA>  gross 2

[[3]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2     v
1  Sarah        1        1     NA      1    yum  gross   yum
2   Ella        4        4      2      2    yum    yum   yum
3 Carmen       NA        3      0      1   <NA>    yum   yum
4  Dinah        6        5      4      4  gross   <NA> gross
5 Billie       NA       NA     NA      2   <NA>  gross gross

The problems with this attempt are:

  1. the new columns are not correctly named (they should be named cheese, milk and tofu rather than all be called v)
  2. the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in col.list).

Thanks so much for any help!

CodePudding user response:

(1) You have to wrap v into the curly-curly operator and use :=:

library(dplyr)

col.list <- c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

returns

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

[...]

which is one step closer to your desired output.

(2) But to get your desired output, I suggest using purrr:

library(purrr)
library(dplyr)

col.list %>% 
  map(~mydata %>% 
        select(name, starts_with(.x)) %>% 
        mutate({{ .x }} := ifelse(
          is.na(!!sym(paste0(.x, ".1"))), 
          !!sym(paste0(.x, ".2")), 
          !!sym(paste0(.x, ".1"))
          )
        )
  ) %>% 
  reduce(left_join, by = "name")

This returns

    name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1  Sarah        1        1      1     NA      1    1    yum  gross   yum
2   Ella        4        4      4      2      2    2    yum    yum   yum
3 Carmen       NA        3      3      0      1    0   <NA>    yum   yum
4  Dinah        6        5      6      4      4    4  gross   <NA> gross
5 Billie       NA       NA     NA     NA      2    2   <NA>  gross gross

CodePudding user response:

Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1 and 2

library(dplyr)
library(tidyr)

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>% 
  mutate(across(where(is.numeric), as.character)) %>% 
  pivot_longer(-name,
               names_to = c("food", "nr"),
               names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#>    name  food   nr    value
#>    <chr> <chr>  <chr> <chr>
#>  1 Sarah cheese 1     1    
#>  2 Sarah cheese 2     1    
#>  3 Sarah milk   1     <NA> 
#>  4 Sarah milk   2     1    
#>  5 Sarah tofu   1     yum  
#>  6 Sarah tofu   2     gross
#>  7 Ella  cheese 1     4    
#>  8 Ella  cheese 2     4    
#>  9 Ella  milk   1     2    
#> 10 Ella  milk   2     2    
#> # ... with 20 more rows

Apply ifelse() function after transforming back to different wide format

mydata_wide <- mydata_long %>% 
  pivot_wider(names_from = nr,
              values_from = value) %>% 
  mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>% 
  arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#>    name   food   `1`   `2`   final_val
#>    <chr>  <chr>  <chr> <chr> <chr>    
#>  1 Sarah  cheese 1     1     1        
#>  2 Ella   cheese 4     4     4        
#>  3 Carmen cheese <NA>  3     3        
#>  4 Dinah  cheese 6     5     6        
#>  5 Billie cheese <NA>  <NA>  <NA>     
#>  6 Sarah  milk   <NA>  1     1        
#>  7 Ella   milk   2     2     2        
#>  8 Carmen milk   0     1     0        
#>  9 Dinah  milk   4     4     4        
#> 10 Billie milk   <NA>  2     2        
#> 11 Sarah  tofu   yum   gross yum      
#> 12 Ella   tofu   yum   yum   yum      
#> 13 Carmen tofu   <NA>  yum   yum      
#> 14 Dinah  tofu   gross <NA>  gross    
#> 15 Billie tofu   <NA>  gross gross
mydata_wide2 <- mydata_wide %>% 
  pivot_wider(-c(`1`, `2`),
              names_from = food,
              values_from = final_val) 
mydata_wide2
#> # A tibble: 5 x 4
#>   name   cheese milk  tofu 
#>   <chr>  <chr>  <chr> <chr>
#> 1 Sarah  1      1     yum  
#> 2 Ella   4      2     yum  
#> 3 Carmen 3      0     yum  
#> 4 Dinah  6      4     gross
#> 5 Billie <NA>   2     gross

Created on 2021-10-29 by the reprex package (v2.0.1)

CodePudding user response:

I would use purrr::map_dfc and coalesce here. Looks pretty straightforward.

library(purrr)
library(dplyr)
library(stringr)

mydata %>% mutate(map2_dfc(select(., ends_with('1')),
                           select(., ends_with('2')),
                           ~coalesce(.x, .y)))%>%
  select(-ends_with('2'))%>%
  rename_with(~str_remove(.x, '\\.\\d $'))

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross

CodePudding user response:

Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.

library(tidyverse)
mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)

Which gives

# A tibble: 5 x 4
  name   cheese  milk tofu 
  <chr>   <dbl> <dbl> <chr>
1 Sarah       1     1 yum  
2 Ella        4     2 yum  
3 Carmen      3     0 yum  
4 Dinah       6     4 gross
5 Billie     NA     2 gross

Alternative solution with coalesce:

mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)

CodePudding user response:

Here is how you can achieve your task:

  1. define your pairs (in case you have hundreds of columns, this could be automated.
  2. use imap_dfc to apply coalesce do the defined pairs
  3. bind to original dataframe
library(dplyr)
library(purrr)

pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))

imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>% 
  bind_cols(mydata)
  cheese milk  tofu   name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1      1    1   yum  Sarah        1        1     NA      1    yum  gross
2      4    2   yum   Ella        4        4      2      2    yum    yum
3      3    0   yum Carmen       NA        3      0      1   <NA>    yum
4      6    4 gross  Dinah        6        5      4      4  gross   <NA>
5     NA    2 gross Billie       NA       NA     NA      2   <NA>  gross

CodePudding user response:

Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1, cheese.2, and cheese.3) this will still work. This will also work if the columns are unbalanced (eg 3 cheese columns, but only 2 milk columns):

library(purrr)
library(stringr)
library(rlang)

str_exp <- map(set_names(col.list), ~ str_glue("exec(coalesce, !!!across(starts_with(\"{.x}\")))"))
express <- map(str_exp, parse_expr)

mydata %>%
  mutate(!!! express,
         .keep = "unused")

Output

The other columns were removed by .keep = "unused". If you want to keep all the columns then delete that argument.

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross

How it works

  1. The use of map and set_names is important because this creates a named list, which is important for the big-bang !!! operator later. The first map creates a named list of string representations of expressions.
  2. The use of exec and coalesce allows the dynamic tidy-selection of columns.
  3. The second map turns each of these elements into an expression. As mentioned in the comments by @MartinGal you could combine this into one map with a pipe, but I broke it out so it was more readable.
  4. The !!! operator force-splices the list of objects and the names for the columns are from the list names set up using map and set_names.
  • Related