Is there way to optimize the speed for changing values in a >2 million row data frame?-CodePudding

I have got a tibble of more than 2 million rows. One of the columns size is a value using M to represent million, k to represent thousand; it also has some <NA> values. The column type is character, like the following:

size
1.3M
5k
302
8.6M
<NA>
4.4k
21

...and so on.

I tried the following code:

for (i in 1:length(example$size)) {
  if (!is.na(example$size[i])) {
    if (str_sub(example$size[i],-1,-1) == "M") {
      example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000000
    } else if (str_sub(example$size[i],-1,-1) == "k") {
      example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000
    }
  }
}

But it took more than half hour and still running, so I interrupted that as I was not sure if my code was wrong and it's in a infinite loop. Is there anything wrong or any way of coding to improve the efficiency?

CodePudding user response：

Try this instead:

size <- c("1.3M","5k",NA,21,"4.4k")

size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)

output:

> size
[1] "1300000" "5000"    NA        "21"      "4400"

CodePudding user response：

tl;dr vectorizing speeds things up by a factor of 5, trying to be clever about avoiding replicate processing gets a 30-fold speed gain. Still takes about 1.5 seconds for a vector of length 50,000 (so expecting about 1 minute for 2 million entries ...)

both the original method and @KacZdr's suggestion produce character vectors, because replacing values within a character vector by numeric values coerces them back to character (you could always use as.numeric() at the end); @KacZdr's solution gives warnings.

size <- c("1.3M","5k",NA,21,"4.4k")
bigsize <- c(replicate(1e4, size))  # big(ish) example for benchmarking

## process outside of function to avoid repetition

prefixes <- c("M"=1e6, "k"=1e3)
re <- sprintf("[%s]", paste(names(prefixes), collapse =""))

rep1 <- function(size) {
    rx <- regexpr(re, size)         ## find matches
    w <- which(!is.na(rx) & rx > 0) ## indices for replacement
    sw <- size[w]
    vals <- prefixes[substr(sw, rx[w], rx[w])]      ## find letter values
    result <- numeric(length(size))                 ## allocate result vector
    result[-w] <- as.numeric(size[-w])              ## assign non-suffixed values
    result[w] <- as.numeric(sub(re, "", sw))*vals   ## assign suffixed values
    result
}

Wrap the other two approaches in functions for benchmarking:

rep2 <- function(size) {
    size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
    size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)
    return(size)
}

Original:

library(stringr)
rep3 <- function(size) {
    for (i in 1:length(size)) {
        if (!is.na(size[i])) {
            if (str_sub(size[i],-1,-1) == "M") {
                size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000000
            } else if (str_sub(size[i],-1,-1) == "k") {
                size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000
            }
        }
    }
    size
}

library(rbenchmark)
benchmark(rep1(bigsize), rep2(bigsize), rep3(bigsize))[,1:5]

           test replications elapsed relative user.self
1 rep1(bigsize)          100   1.451    1.000     1.452
2 rep2(bigsize)          100   7.812    5.384     7.807
3 rep3(bigsize)          100  41.489   28.593    41.485