Home > OS >  Vectorize dataframe in a for loop
Vectorize dataframe in a for loop

Time:07-25

How would you vectorize this for loop in R? This is just a small example but i would need to do this for 11 columns.

for(i in 2:nrow(df1)){
  if(df1[i, 'sku2'] == ''){
    df1[i,'sku2'] <- df1[i - 1, 'sku2']
  }
}
return(df1)

Data

df1 <- data.frame(sku2 = rep(c(1:2, ""), times = 5),
                  a    = rep(c(3:4, ""), times = 5))

CodePudding user response:

The task here, if I understand it, is to replace blanks in each column with the most recent non-blank of that column.

Here's a vectorized tidyverse approach:

library(tidyverse)
df1 %>%
 mutate_all(na_if,"") %>%
 fill(names(df1), .direction = "down)

This takes the df1 data frame, converts blanks to NAs, then uses tidyr::fill on every column to use the last non-blank value.

I expect this will be much faster [edit: it is 100-1000x faster] than your loop, but if you want "the fastest possible" approach, it may be worth looking into approaches using the data.frame or collapse packages.

Performance

For an example with 100,000 rows and 3 columns, this approach was 100x faster: 0.156 seconds instead of 15.5 seconds.

For 200k rows, the loop took 150x as much time: 31 seconds vs. 0.2 seconds.

For 400k rows, the loop was 600x slower: 123 seconds vs. still 0.2 sec for the vectorized version.

I ran out of patience for testing larger data with the loop, but the vectorized version is still only 0.4 seconds with 5 million rows...

set.seed(42)
n = 1E5
df1 <- data.frame(sku2 = sample(c(1:4, ""), n, replace = TRUE),
                  sku3 = sample(c(1:8, ""), n, replace = TRUE),
                  sku4 = sample(c(1:20, ""), n, replace = TRUE))

CodePudding user response:

You were on the right path with vectorization. This for loop can be replaced entirely by using vectorised operations. My approach was using the indexing operator [] .

df1[df1[1] == "", 1] <- df1[which(df1[1] == "") - 1, 1]

df1[1] == "" returns a boolean result. I needed to use which to convert it into a row number, in order to do -1 row.

To make it work across 11 columns and to avoid copy paste, here is a function:

rm.blank <- function(x){
  df1[df1[x] == "", x]  <- df1[which(df1[x] == "") - 1, x]
  return(df1[x])
}

I just did not figure out yet how to write a second function that applies the first function to all columns. Sooo... I created another for loop.

for (i in 1:ncol(df1)) {
  df1[i] <- rm.blank(i)
}
  • Related