Home > Software engineering >  R improve loop efficiency: Operating on columns that correspond to rows in a second dataframe
R improve loop efficiency: Operating on columns that correspond to rows in a second dataframe

Time:08-25

I have two data frames:

dat <- data.frame(Digits_Lower = 1:5,
                  Digits_Upper = 6:10,
                  random = 20:24)
dat
#>   Digits_Lower Digits_Upper random
#> 1            1            6     20
#> 2            2            7     21
#> 3            3            8     22
#> 4            4            9     23
#> 5            5           10     24

cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
                 x = 1:2, 
                 y = 3:4)
cb                 
#>         Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4

I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable for a large dataset. I have created this for-loop so far.

dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
  dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <- 
    #some operation using every value in a column in `dat` with its corresponding row in `cb` 
    (dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop 
#>   Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1            1            6     20                -2                 4
#> 2            2            7     21                -1                 6
#> 3            3            8     22                 0                 8
#> 4            4            9     23                 1                10
#> 5            5           10     24                 2                12

I will then perform operations on the data that I appended to dat in dat.loop applying a similar for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine my use of for-loops will become cumbersome. I am wondering:

  1. Would another method improve efficiency such as using data.table or tidyverse?

  2. How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to do this.

Any help is appreciated!

EDIT:

I've modified the code @Desmond provided allowing for more generic code since dat and cb will be from user-inputted files, and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with "Digits_" and will be specified in the "Digits" column of cb.

library(tidyverse)

results <- dat %>% 
  crossing(cb) %>% 
  rowwise() %>%
  mutate(disp = (get(`Digits`)-y) *x ) %>%
  ungroup() %>% 
  pivot_wider(names_from = Digits,
              values_from = disp,
              names_prefix = "disp_")

results2 <- results %>% 
  group_by(random) %>% 
  fill(starts_with("disp"), .direction = c("downup")) %>% 
  ungroup() %>% 
  select(-c(x,y)) %>% 
  unique()
              
results2
#>   Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1            1            6     20                -2                 4
#> 2            2            7     21                -1                 6
#> 3            3            8     22                 0                 8
#> 4            4            9     23                 1                10
#> 5            5           10     24                 2                12

CodePudding user response:

Here's a tidyverse solution:

  • crossing generates combinations from both datasets
  • case_when to apply your logic
  • pivot_wider, filter and bind_cols to clean up the output

To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1

dat <- data.frame(
  Digits_Lower = 1:5,
  Digits_Upper = 6:10,
  random = 20:24
)

cb <- data.frame(
  Digits = c("Digits_Lower", "Digits_Upper"),
  x = 1:2,
  y = 3:4
)

results <- dat |>
  crossing(cb) |>
  mutate(disp = case_when(
    Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
    Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
  )) |>
  pivot_wider(names_from = Digits,
              values_from = disp,
              names_prefix = "disp_")

results |>
  filter(!is.na(disp_Digits_Lower)) |>
  select(-c(x, y, disp_Digits_Upper)) |> 
  bind_cols(results |> 
              filter(!is.na(disp_Digits_Upper)) |> 
              select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#>   Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#>          <int>        <int>  <int>             <int>             <int>
#> 1            1            6     20                -2                 4
#> 2            2            7     21                -1                 6
#> 3            3            8     22                 0                 8
#> 4            4            9     23                 1                10
#> 5            5           10     24                 2                12

Created on 2022-08-20 by the reprex package (v2.0.1)

  • Related