Home > Software engineering >  Apply function to column by segments in R
Apply function to column by segments in R

Time:12-12

I have a function f that needs to be applied to a single column of length n in segments of m length, where m divides n. (For example, to a column of 1000 values, apply f to the first 250 values, then to 250-500, ...).

A loop is overkill, since the column has over 16 million values. I was thinking the efficient way would be to separate the column of length n into q vectors of length m, where mq = n. Then I could apply f simultaneously to all this vectors using some lapply-like functionality. Then I cold join the q vectors to obtain the transformed version of the column.

Is that the efficient way to go here? If so, what function could decompose a column into q vectors of equal length and what function should I use to broadcast f across the q vectors?

Lastly, although less importantly, what if we wanted to do this to several columns and not just one?

CodePudding user response:

A way to do it is to create an auxiliar variable, so you can apply to each variable, depending on your function you can use group_by and/or summarize, an example:

df <- data.frame(
  x = rnorm(15),
  y = rnorm(15),
  z = rnorm(15)
)

library(dplyr)

df %>% 
  mutate(
    aux = rep(1:3,each = (nrow(df)/3)),
    across(.cols = c(x,y,z),.fns = ~ .   2 * aux)
    ) 

          x        y        z aux
1  2.164841 2.882465 2.139098   1
2  2.364115 2.205598 2.410275   1
3  2.552158 1.383564 1.441543   1
4  1.398107 1.265201 2.605371   1
5  1.006301 1.868197 1.493666   1
6  5.026785 4.310017 2.579434   2
7  4.751061 2.960320 4.127993   2
8  2.490833 3.815691 5.945851   2
9  3.904853 4.967267 4.800914   2
10 3.104052 3.891720 5.165253   2
11 3.929249 5.301579 6.358856   3
12 6.150120 5.724055 5.391443   3
13 5.920788 7.114649 5.797759   3
14 5.902631 6.550044 5.726752   3
15 6.216153 7.236676 5.531300   3

CodePudding user response:

Based on the size of your data set, you might be in SQL territory. If however, you're intent on solving this issue with R I would recommend the data.table package which runs parallel data wrangling right out of the box. In data.table this would be as simple as

# for a single col
dataframe[,new_column := f(column)]

# for multiple cols
col_names <- c("a", "b", "c")
dataframe[,c(col_names) := lapply(.SD, f), .SDcols = col_names]

Otherwise, if you want to go base R, then you're probably looking for split() and lapply.

  • Related