Loop on several variables with the same suffix in R-CodePudding

I have a database which looks like this but with much more rows and columns.

Several variables (x,y,z) measured at different time (1,2,3).

df <-
  tibble(
    x1 = rnorm(10),
    x2 = rnorm(10),
    x3 = rnorm(10),
    y1 = rnorm(10),
    y2 = rnorm(10),
    y3 = rnorm(10),
    z1 = rnorm(10),
    z2 = rnorm(10),
    z3 = rnorm(10),
  )

I am trying to create dummies variables from the variables with the same suffix (measured at the same time) like this:

df <- df %>% 
  mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>% 
  mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
  mutate(var3= ifelse(x3>0 & (y1<0.5 |z3<0.5),0,1))

I am used to coding in SAS or Stata, so I would like to use a function or a loop because I have many more variables in my database. But I think I don't have the right approach in R to deal with this.

Thank you very much for your help !

CodePudding user response：

{dplyover} makes this kind of operation easy (disclaimer: I'm the maintainer), given that your desired output contains a typo:

I think you want to use all variables with the same digit (1, 2, 3 and so on) in each calculation:

df <- df %>% 
  mutate(var1= ifelse(x1>0 & (y1<0.5 |z1<0.5),0,1)) %>% 
  mutate(var2= ifelse(x2>0 & (y2<0.5 |z2<0.5),0,1)) %>%
  mutate(var3= ifelse(x3>0 & (y3<0.5 |z3<0.5),0,1))

If that is the case we can use dplyover::over to apply the same function over a vector. Here we construct the vector with extract_names("[0-9]{1}$") which gets us all ending numbers of our variable names here: c(1,2,3). We can then construct the variable names using a special syntax: .("x{.x}"). Here .x evaluates to the first number in our vector so it would return the object name x1 (not a string!) which we can use inside the function argument of over.

library(dplyr)
library(dplyover) # Only on GitHub: https://github.com/TimTeaFan/dplyover

df %>% 
  mutate(over(cut_names("^[a-z]{1}"),
              ~ ifelse(.("x{.x}") > 0 & (.("y{.x}") < 0.5 | .("z{.x}") < 0.5), 0, 1),
              .names = "var{x}"
              ))

#> # A tibble: 10 x 12
#>        x1      x2      x3      y1     y2     y3     z1     z2       z3  var1
#>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl> <dbl>
#>  1  0.690  0.550   0.911   0.203  -0.111  0.530 -2.09   0.189  0.147       0
#>  2 -0.238  1.32   -0.145   0.744   1.05  -0.448  2.05  -1.04   1.50        1
#>  3  0.888  0.898  -1.46   -1.87   -1.14   1.59   1.91  -0.155  1.46        0
#>  4 -2.78  -1.34   -0.486  -0.0674  0.246  0.141  0.154  1.08  -0.319       1
#>  5 -1.20   0.835   1.28   -1.32   -0.674  0.115  0.362  1.06   0.515       1
#>  6  0.622 -0.713   0.0525  1.79   -0.427  0.819 -1.53  -0.885  0.00237     0
#>  7 -2.54   0.0197  0.942   0.230  -1.37  -1.02  -1.55  -0.721 -1.06        1
#>  8 -0.434  1.97   -0.274   0.848  -0.482 -0.422  0.197  0.497 -0.600       1
#>  9 -0.316 -0.219   0.467  -1.97   -0.718 -0.442 -1.39  -0.877  1.52        1
#> 10 -1.03   0.226   2.04    0.432  -1.02  -0.535  0.954 -1.11   0.804       1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>

Alternatively we can use dplyr::across and use cur_column(), get() and gsub() to alter the name of the column on the fly. To name the new variables correctly we use gsub() in the .names argument of across and wrap it in curly braces {} to evaluate the expression.

library(dplyr)

df %>% 
  mutate(across(starts_with("x"),
                ~ {
                  cur_c <- dplyr::cur_column()
                  ifelse(.x > 0 & (get(gsub("x","y", cur_c)) < 0.5 | get(gsub("x","z", cur_c)) < 0.5), 0, 1)
                },
                .names = '{gsub("x", "var", .col)}'
                ))

#> # A tibble: 10 x 12
#>         x1      x2     x3     y1      y2     y3      z1      z2      z3  var1
#>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl>
#>  1 -0.423  -1.42   -1.15  -1.54   1.92   -0.511 -0.739   0.501   0.451      1
#>  2 -0.358   0.164   0.971 -1.61   1.96   -0.675 -0.0188 -1.88    1.63       1
#>  3 -0.453  -0.758  -0.258 -0.449 -0.795  -0.362 -1.81   -0.780  -1.90       1
#>  4  0.855   0.335  -1.36   0.796 -0.674  -1.37  -1.42   -1.03   -0.560      0
#>  5  0.436  -0.0487 -0.639  0.352 -0.325  -0.893 -0.746   0.0548 -0.394      0
#>  6 -0.228  -0.240  -0.854 -0.197  0.884   0.118 -0.0713  1.09   -0.0289     1
#>  7 -0.949  -0.231   0.428  0.290 -0.803   2.15  -1.11   -0.202  -1.21       1
#>  8  1.88   -0.0980 -2.60  -1.86  -0.0258 -0.965 -1.52   -0.539   0.108      0
#>  9  0.221   1.58   -1.46  -0.806  0.749   0.506  1.09    0.523   1.86       0
#> 10  0.0238 -0.389  -0.474  0.512 -0.448   0.178  0.529   1.56   -1.12       1
#> # ... with 2 more variables: var2 <dbl>, var3 <dbl>

^{Created on 2022-06-08 by the reprex package (v2.0.1)}

CodePudding user response：

You could restructure your data along the principles of tidy data (see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).

Here to a long format and using tidyverse:

library(tidyverse)

df <-
  df |>
  pivot_longer(everything()) |>
  separate(name, c("var", "time"), sep = "(?=[0-9])") |>
  pivot_wider(id_col = "time",
              names_from = "var",
              names_prefix = "var_",
              values_from = "value",
              values_fn = list) |>
  unnest(-time) |>
  mutate(new_var = ifelse(var_x > 0 & (var_y < 0.5 | var_z < 0.5), 0, 1))

  df

You would probably want to keep the data in a long format, but if you want, you can pivot_wider and get back to the format you started with. E.g.

df |>
  pivot_wider(values_from = c(starts_with("var_"), "new_var"),
              names_from = "time",
              values_fn = list) |> 
  unnest(everything())

CodePudding user response：

As you suggested, a solution using a loop is definitely possible.

# times as unique non-alphabetical parts of column names
times <- unique(gsub('[[:alpha:]]', '', names(df)))
for (time in times) {
  
  # column names for current time
  xyz <- paste0(c('x', 'y', 'z'), time)
  df[[paste0('var', time)]] <- 
    ifelse(df[[xyz[1]]]>0 & (df[[xyz[2]]]<.5 | df[[xyz[3]]]<.5), 0, 1)
}

Another way I can think of is transforming the data into a 3D array (observartion × variable × time) so that you can actually do the computation for all variables at once.

times <- unique(gsub('[[:alpha:]]', '', names(df)))
df.arr <- sapply(c('x', 'y', 'z'), 
                 function(var) as.matrix(df[, paste0(var, times)]), 
                 simplify='array')
new.vars <- ifelse(df.arr[, , 1]>0 & (df.arr[, , 2]<0.5 | df.arr[, , 3]<0.5), 0, 1)
colnames(new.vars) <- paste0('var', times)
cbind(df, new.vars)

Here, sapply creates a matrix from columns of measurings for each variable at different times and stacks them into a 3D array.

If you trust (or ensure) correct ordering of columns in the data frame, instead of using sapply you can create the array just by modifying the object's dimensions. I didn't do any benchmarking but i guess this could be the most computationally efficient solution (if it should matter).

df.arr <- as.matrix(df)
dim(df.arr) <- c(dim(df.arr) / c(1, 3), 3)