Home > database >  R: Using STRSPLIT and GREP on vector elements on large dataset takes too long
R: Using STRSPLIT and GREP on vector elements on large dataset takes too long

Time:02-16

(My first StackFlow question)

My goal is to improve the ETL process for identifying which NetApp file shares are related to which AD permission distribution groups. Currently an application named 'TreeSize' scans a number of volumes and outputs a number of large .CSV files (35mb ). I want to merge this data and remove all permission information where each group (or named user) doesn't start with a capital G or D ('^[GD]'). With over 700,000 rows to process, it's currently taking me over 24hr to run. I hope there is a better way to process this data more efficiently to drastically cut that time down.

Here is test data which resembles actual data once all files have been merged. Use rownum to adjust size of data. (Real data 700000 )

Test Data

set.seed(42)
rownum <- 2000  #Real number over 700000
i <- 1
datalist <- list()

while (i <= rownum) {
  
  randomStr1 <- paste(sample(c(0:9, letters, LETTERS[4:7], "-"),10, replace=TRUE),collapse="")
  randomStr2 <- paste(sample(c(0:9, letters, LETTERS[4:7], " "),10, replace=TRUE),collapse="")
  randomStr3 <- paste(sample(c(0:9, letters, LETTERS[4:7], " & "),10, replace=TRUE),collapse="")
  randomStr4 <- sample(c("full", " r w x", " r x"),3)
  
  datalist$volume[i] <- rep(sample(LETTERS[1:6]))[1]
  datalist$permissions[i] <- paste(c(randomStr1,randomStr2,randomStr3),randomStr4,sep = ': ',collapse = ' | ')
  
  i = i 1
}
dat <- data.frame(datalist)
View(dat)

I have create a WHILE loop that cycles through my merged data. I first use STRSPLIT to create a vector containing each vector element between “ | “ each pipe. Then I pass each vector element in a GREP command searching for a RegExp of (‘^[GD]’). If found it keeps the vector element and if more than one was found it will collapse the data back together between a semicolon and space (“; “)

Here is my current way of doing this.

  i <- 1
while (i <= length(dat$permissions)) {
  df <- strsplit(dat$permissions, " \\| |: ")[[i]]              #create a vector containing each vector element      
  dat$permissions[i] <- paste(df[grep('^[GD]', df)], collapse = "; ") #Only keep where starts with G or D then Paste together
  print(paste(i, " of ", length(dat$permissions), " ", dat$permissions[i]))
i = i   1 }
View(dat)

After complete, I export to one .CSV file to complete my Transformation.

What would be a better way to process this data to drastically reduce the time it takes to process?

CodePudding user response:

One option to speed things up is to avoid splitting the string and to extract the matches directly using stringr::str_extract_all():

library(stringr)
library(purrr)

map_chr(str_extract_all(dat$permissions, "(?<=^|\\| )[GD].*?(?=:)"), str_c, collapse =  "; ")

This offers a further improvement on thelatemail's already fast alternative:

microbenchmark::microbenchmark(
  extract = map_chr(str_extract_all(dat$permissions, "(?<=^|\\| )[GD].*?(?=:)"), str_c, collapse =  "; "),
  splitmatch = sapply(strsplit(dat$permissions, " \\| |: "), \(x) paste(x[grepl("^[GD]", x)], collapse = "; ")),
  check = "equal"
)

Unit: milliseconds
       expr     min       lq      mean  median       uq     max neval
    extract  5.8841  6.14375  6.681234  6.3134  6.41655 15.1264   100
 splitmatch 23.6010 24.00005 25.501808 24.2499 24.78320 40.4140   100

CodePudding user response:

I think the key to speeding this up is to avoid looping over each row, when it can be done in a single vectorised operation for the strsplit and final paste operations.

paste(
  seq_along(dat$permissions), "of", nrow(dat), 
  lapply(strsplit(dat$permissions, " \\| |: "), 
         \(x) paste(x[grepl("^[GD]", x)], collapse="; "))
)

Should result in a ~250 times speedup:

system.time({
paste(
  seq_along(dat$permissions), "of", nrow(dat), 
  lapply(strsplit(dat$permissions, " \\| |: "), \(x) paste(x[grepl("^[GD]", x)], collapse="; "))
)
})
##   user  system elapsed 
##   0.03    0.00    0.03 

system.time({
  i <- 1
while (i <= length(dat$permissions)) {
  df <- strsplit(dat$permissions, " \\| |: ")[[i]]              #create a vector containing each vector element      
  dat$permissions[i] <- paste(df[grep('^[GD]', df)], collapse = "; ") #Only keep where starts with G or D then Paste together
  print(paste(i, " of ", length(dat$permissions), " ", dat$permissions[i]))
i = i   1 }
})
##   user  system elapsed 
##   8.11    0.06    8.17
  • Related