Optimization : splitting a column into a thoursand columns in R or SQLite-CodePudding

I need to analyse data from a very large dataset. For that, I need to separate a character variable into more than a thousand columns.

The structure of this variable is :

number$number$number$ and so on for a thousand numbers

My data is stored in a .db file from SQLite. I then imported it in R using the package "RSQLite".

I tried splitting this column into multiple columns using dplyr :

#d is a data.table with my data

d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))

It works, but it is also taking forever. Do someone have a solution to split this column faster (either on R or using SQLite) ?

Thanks.

CodePudding user response：

You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:

test <- data.frame(
  long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
)
system.time({
  test |> 
    tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#>    user  system elapsed 
#>   0.352   0.012   0.365

system.time({
  test |> 
    tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
})
#>    user  system elapsed 
#>   0.117   0.000   0.118

^{Created on 2023-02-03 with reprex v2.0.2}

CodePudding user response：

You can try to write the file as is and then try to load it with fread, which is in general rather fast.

library(data.table)
library(dplyr)
library(tidyr)

# Prepare example
x <- matrix(rnorm(1000*10000), ncol = 1000)
dta <- data.frame(value = apply(x, 1, function(x) paste0(x, collapse = "$")))

# Run benchmark
microbenchmark::microbenchmark({
    dta_2 <- dta %>%
      separate(col = value, sep = "\\$", into = paste0("col_", 1:1000))
  },
  {
    tmp_file <- tempfile()
    fwrite(dta, tmp_file)
    dta_3 <- fread(tmp_file, sep = "$", header = FALSE)
  }, times = 3
)

Edit: I tested the speed and it seems faster than dt_seperate from tidyfast, but it depends on the size of your dataset.