I have a function f
that needs to be applied to a single column of length n
in segments of m
length, where m
divides n
. (For example, to a column of 1000 values, apply f
to the first 250 values, then to 250-500, ...).
A loop is overkill, since the column has over 16 million values. I was thinking the efficient way would be to separate the column of length n
into q
vectors of length m
, where mq = n
. Then I could apply f
simultaneously to all this vectors using some lapply-like functionality. Then I cold join the q
vectors to obtain the transformed version of the column.
Is that the efficient way to go here? If so, what function could decompose a column into q
vectors of equal length and what function should I use to broadcast f
across the q
vectors?
Lastly, although less importantly, what if we wanted to do this to several columns and not just one?
CodePudding user response:
A way to do it is to create an auxiliar variable, so you can apply to each variable, depending on your function you can use group_by
and/or summarize
, an example:
df <- data.frame(
x = rnorm(15),
y = rnorm(15),
z = rnorm(15)
)
library(dplyr)
df %>%
mutate(
aux = rep(1:3,each = (nrow(df)/3)),
across(.cols = c(x,y,z),.fns = ~ . 2 * aux)
)
x y z aux
1 2.164841 2.882465 2.139098 1
2 2.364115 2.205598 2.410275 1
3 2.552158 1.383564 1.441543 1
4 1.398107 1.265201 2.605371 1
5 1.006301 1.868197 1.493666 1
6 5.026785 4.310017 2.579434 2
7 4.751061 2.960320 4.127993 2
8 2.490833 3.815691 5.945851 2
9 3.904853 4.967267 4.800914 2
10 3.104052 3.891720 5.165253 2
11 3.929249 5.301579 6.358856 3
12 6.150120 5.724055 5.391443 3
13 5.920788 7.114649 5.797759 3
14 5.902631 6.550044 5.726752 3
15 6.216153 7.236676 5.531300 3
CodePudding user response:
Based on the size of your data set, you might be in SQL territory. If however, you're intent on solving this issue with R I would recommend the data.table
package which runs parallel data wrangling right out of the box. In data.table this would be as simple as
# for a single col
dataframe[,new_column := f(column)]
# for multiple cols
col_names <- c("a", "b", "c")
dataframe[,c(col_names) := lapply(.SD, f), .SDcols = col_names]
Otherwise, if you want to go base R, then you're probably looking for split()
and lapply
.