I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat
similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb
. The solution should be applicable
for a large dataset. I have created this for-loop
so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat
in dat.loop
applying a similar
for-loop
, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using
data.table
ortidyverse
?How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code to perform operations on columns in
dat
with corresponding rows incb
. Ideally, I would split my for-loop into multiple functions that would for example, avoid indexing intocb
for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to do this.
Any help is appreciated!
EDIT:
I've modified the code @Desmond provided allowing for more generic code since dat
and cb
will be from user-inputted files,
and dat
can have a varying number of columns/ column names that I will be operating on (columns in dat
will always start with
"Digits_" and will be specified in the "Digits" column of cb
.
library(tidyverse)
results <- dat %>%
crossing(cb) %>%
rowwise() %>%
mutate(disp = (get(`Digits`)-y) *x ) %>%
ungroup() %>%
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
group_by(random) %>%
fill(starts_with("disp"), .direction = c("downup")) %>%
ungroup() %>%
select(-c(x,y)) %>%
unique()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
CodePudding user response:
Here's a tidyverse solution:
crossing
generates combinations from both datasetscase_when
to apply your logicpivot_wider
,filter
andbind_cols
to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing()
with crossing.()
, pivot_wider()
with pivot_wider.()
, etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)