Home > Software engineering >  How to shorten this long dplyr syntax?
How to shorten this long dplyr syntax?

Time:10-25

In a tibble, I would like to be able to correct certain values taken by the variables nbeta_dep01, nbeta_dep02 ...

Below is a reproducible example of what I'm doing.

I would like to know if there is a way to shorten the syntax (because in my example I copy and paste as many times the correction instruction as I have of variable nbeta_depXX)

suppressMessages(library(dplyr))

test <- tribble(
  ~ent, ~dep_impl, ~nbeta_dep01, ~nbeta_dep02, ~nbeta_dep03, ~nbeta_dep04, ~nbeta_dep05,
  "a",  "01",  0,   0,   0,   0,   0,  
  "b",  "03",  2,   0,   3,   0,   1,
  "c",  "05",  0,   0,   0,   1,   0,
  "d",  "02",  0,   0,   0,   0,   0
)

test %>% 
  rowwise() %>% 
  mutate(
    nbeta_dep01 = ifelse(
      nbeta_dep01==0 & nbeta_dep02==0 & nbeta_dep03==0 & nbeta_dep04==0 & nbeta_dep05==0 & dep_impl=="01",
      1,
      nbeta_dep01),
    nbeta_dep02 = ifelse(
      nbeta_dep01==0 & nbeta_dep02==0 & nbeta_dep03==0 & nbeta_dep04==0 & nbeta_dep05==0 & dep_impl=="02",
      1,
      nbeta_dep02),
    nbeta_dep03 = ifelse(
      nbeta_dep01==0 & nbeta_dep02==0 & nbeta_dep03==0 & nbeta_dep04==0 & nbeta_dep05==0 & dep_impl=="03",
      1,
      nbeta_dep03),
    nbeta_dep04 = ifelse(
      nbeta_dep04==0 & nbeta_dep02==0 & nbeta_dep03==0 & nbeta_dep04==0 & nbeta_dep05==0 & dep_impl=="04",
      1,
      nbeta_dep04),
  )
#> # A tibble: 4 x 7
#> # Rowwise: 
#>   ent   dep_impl nbeta_dep01 nbeta_dep02 nbeta_dep03 nbeta_dep04 nbeta_dep05
#>   <chr> <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#> 1 a     01                 1           0           0           0           0
#> 2 b     03                 2           0           3           0           1
#> 3 c     05                 0           0           0           1           0
#> 4 d     02                 0           1           0           0           0
Created on 2021-10-25 by the reprex package (v2.0.1)

CodePudding user response:

You could use

library(dplyr)
library(stringr)

test %>% 
  mutate(across(matches("dep\\d $"), 
       ~ifelse(rowSums(across(nbeta_dep01:nbeta_dep05)) == 0 & dep_impl == str_extract(cur_column(), "\\d $"),
               1,
               .x)))

This returns

# A tibble: 4 x 7
  ent   dep_impl nbeta_dep01 nbeta_dep02 nbeta_dep03 nbeta_dep04 nbeta_dep05
  <chr> <chr>          <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
1 a     01                 1           0           0           0           0
2 b     03                 2           0           3           0           1
3 c     05                 0           0           0           1           0
4 d     02                 0           1           0           0           0
  • We identify the columns to be changes with a regular expression: "dep\\d $" matches all columns that end with "dep" followed by two digits. Those columns are used in an across() function.
  • The if statement is simplified: since all nbeta_dep columns need to be 0 we take the sum of those columns by using a rowSum function combined with a selecting across() function. Furthermore, we check, if the digits in current column name match the digits in column dep_impl.
  • If these conditions are met, we return 1 else the value already in the current column/row is returned .x.

CodePudding user response:

You can refer to the columns, whose names all start in the same way, using the function starts_with:

test %>% 
  mutate(across(starts_with("nbeta"),
         ~ifelse(
      nbeta_dep01==0 & nbeta_dep02==0 & nbeta_dep03==0 & nbeta_dep04==0 & nbeta_dep05==0 & dep_impl=="01",
      1,
      nbeta_dep01)))
  • Related