I've got some code looking like this:
library(stringi)
df_values <- data.frame(value = stri_rand_strings(n = 500,
length = 30))
df_keys <- tibble(key = sample(x = 1:500,
size = 25000,
replace = TRUE))
# start timer
start_time <- Sys.time()
df_keys |>
rowwise() |>
mutate(value = df_values$value[key])
# end timer
end_time <- Sys.time()
end_time - start_time
Which requires very much time to run, but I can't figure out why. The code above only requires 0.3003931 seconds. For my real code I subsetted the tibble with head(n)
and got following times:
n | time in secs |
---|---|
50 | 1.993536 |
100 | 3.731 |
200 | 6.550074 |
300 | 9.500864 |
500 | 15.68515 |
1,000 | 32.19306 |
... | seems to be linear |
20,000 | maybe 10 minutes |
Does someone have an idea what could be wrong with my code? I guess it's the indexing-part df_values$value[key]
? But my original df_values
also is a data.frame with 500 obs.
CodePudding user response:
A possible solution, in base R
. As we can see, the execution time takes only 1% of the time, compared to your dplyr
approach. Even removing rowwise
, the execution time is extremely faster with a base R
approach.
library(tidyverse)
library(stringi)
# start timer
start_time <- Sys.time()
df_keys |>
rowwise() |>
mutate(value = df_values$value[key])
#> # A tibble: 25,000 × 2
#> # Rowwise:
#> key value
#> <int> <chr>
#> 1 287 BeFLZsuRxlKJAJLgOnH1SO2f6kjpPH
#> 2 292 yG1JoxKRzSDnBlk4fJKDcKwzAUGwOy
#> 3 334 38pJ1h3RaTTSDgcf7gyCuW2NqFyncZ
#> 4 120 LqqCmTiMQV50hV0c0yYzk94AtpV7I6
#> 5 233 62BsX6NAEQqYx5wjm5ienCYgDmvJDb
#> 6 413 OB2MqTt1SOTb3irKlLEBtr4MfvuWW5
#> 7 123 4IKKUTli7c1l8GwU8TTpWHLHirGCy8
#> 8 400 aDnB9PwIKQkdfAW5kwzM215vU9aCNk
#> 9 214 aOsJkVENbncaHESiU2rwmfXqY5yVsK
#> 10 332 v4DfYVOr9kedtIwnWFlefDfFhHJ25R
#> # … with 24,990 more rows
# end timer
end_time <- Sys.time()
end_time - start_time
#> Time difference of 0.1876147 secs
start_time <- Sys.time()
df_keys$value <- df_values$value[df_keys$key]
end_time <- Sys.time()
end_time - start_time
#> Time difference of 0.002212286 secs