I have got a tibble
of more than 2 million rows. One of the columns size
is a value using M to represent million, k to represent thousand; it also has some <NA>
values. The column type is character
, like the following:
size
1.3M
5k
302
8.6M
<NA>
4.4k
21
...and so on.
I tried the following code:
for (i in 1:length(example$size)) {
if (!is.na(example$size[i])) {
if (str_sub(example$size[i],-1,-1) == "M") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000000
} else if (str_sub(example$size[i],-1,-1) == "k") {
example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000
}
}
}
But it took more than half hour and still running, so I interrupted that as I was not sure if my code was wrong and it's in a infinite loop. Is there anything wrong or any way of coding to improve the efficiency?
CodePudding user response:
Try this instead:
size <- c("1.3M","5k",NA,21,"4.4k")
size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)
output:
> size
[1] "1300000" "5000" NA "21" "4400"
CodePudding user response:
tl;dr vectorizing speeds things up by a factor of 5, trying to be clever about avoiding replicate processing gets a 30-fold speed gain. Still takes about 1.5 seconds for a vector of length 50,000 (so expecting about 1 minute for 2 million entries ...)
- both the original method and @KacZdr's suggestion produce character vectors, because replacing values within a character vector by numeric values coerces them back to character (you could always use
as.numeric()
at the end); @KacZdr's solution gives warnings.
size <- c("1.3M","5k",NA,21,"4.4k")
bigsize <- c(replicate(1e4, size)) # big(ish) example for benchmarking
## process outside of function to avoid repetition
prefixes <- c("M"=1e6, "k"=1e3)
re <- sprintf("[%s]", paste(names(prefixes), collapse =""))
rep1 <- function(size) {
rx <- regexpr(re, size) ## find matches
w <- which(!is.na(rx) & rx > 0) ## indices for replacement
sw <- size[w]
vals <- prefixes[substr(sw, rx[w], rx[w])] ## find letter values
result <- numeric(length(size)) ## allocate result vector
result[-w] <- as.numeric(size[-w]) ## assign non-suffixed values
result[w] <- as.numeric(sub(re, "", sw))*vals ## assign suffixed values
result
}
Wrap the other two approaches in functions for benchmarking:
rep2 <- function(size) {
size <- ifelse(!is.na(size) & grepl("M",size),as.numeric(sub("M.*", "", size))*1000000,size)
size <- ifelse(!is.na(size) & grepl("k",size),as.numeric(sub("k.*", "", size))*1000,size)
return(size)
}
Original:
library(stringr)
rep3 <- function(size) {
for (i in 1:length(size)) {
if (!is.na(size[i])) {
if (str_sub(size[i],-1,-1) == "M") {
size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000000
} else if (str_sub(size[i],-1,-1) == "k") {
size[i] = as.numeric(str_sub(size[i], 1,-2)) * 1000
}
}
}
size
}
library(rbenchmark)
benchmark(rep1(bigsize), rep2(bigsize), rep3(bigsize))[,1:5]
test replications elapsed relative user.self
1 rep1(bigsize) 100 1.451 1.000 1.452
2 rep2(bigsize) 100 7.812 5.384 7.807
3 rep3(bigsize) 100 41.489 28.593 41.485