I'm working on a dataframe consisting of 528 column and 2,643,246 rows. Eight of these are character-variables, and the rest integers. In total, this adds up to 11.35 GiB of data, with my available RAM being at 164 GiB.
I now wanted to run a pivot_longer
on said dataframe, having one row for each column two ID variables (year and institution). There are a total of 671,370 institutions over 76 years.
So atm the data are structured such as this:
Institution | Year | X | Y | Z |
---|---|---|---|---|
A | 1 | 2 | 1 | 3 |
A | 2 | 3 | 4 | 4 |
B | 1 | 3 | 4 | 2 |
B | 2 | 5 | 3 | 2 |
Where I would like to change it so the structure becomes:
Institution | Year | G | N |
---|---|---|---|
A | 1 | X | 2 |
A | 1 | Y | 1 |
A | 1 | Z | 3 |
A | 2 | X | 3 |
A | 2 | Y | 1 |
A | 2 | Z | 4 |
B | 1 | X | 3 |
B | 1 | Y | 4 |
B | 1 | Z | 2 |
B | 2 | X | 5 |
B | 2 | Y | 3 |
B | 2 | Z | 2 |
To achieve this I attempted the following code:
library(tidyverse)
Df <- Df %>% pivot_longer(17:527,
names_to = "G",
values_to = "N"
)
When running this on a small sample-data I manage to achieve the expected results, however when attempting to do the same on the whole dataset I quickly run out of memory. From the object using 11 GiB of memory, it quickly increased to above 150 GiB before returning a "cannot allocate vector of size x Gb" error.
Since I haven't added any data, I can't quite understand where the extra memory usage is coming from. What I wonder therefore is what creates this increase, and whether there is a more efficient way to solve this problem through some other code. Thanks in advance for any help!
CodePudding user response:
I have no ways to test the code on your data, but here is one idea.
The idea is to conduct the wide to long transformation for a chunk of rows one at a time, store the outcome in a list. In the end, combine the list to the final data frame. Hopefully this reduces the memory usage.
If not working, try to see if melt
from data.table
can convert the data more efficiently.
One other idea that could be helpful. Perhaps subset the Df
by removing column 1 to 16 before the wide to long transformation, just keep an ID
column. You can join column 1 to 16 back to the converted data frame later.
library(tidyverse)
Df_list <- list()
Institution <- unique(DF$Institution)
for (i in Institution){
Df_temp <- Df %>%
filter(Institution %in% i) %>%
pivot_longer(17:527, names_to = "G", values_to = "N")
Df_list[[i]] <- Df_temp
}
Df <- bind_rows(Df_list)
CodePudding user response:
For this size data, reshape2
or data.table
's pivoting will likely be more memory-efficient than tidyr
's. On a smaller 800MB sample, reshape2::melt
requires around 2.6x as much memory as the original data, data.table::melt
requires around 3.6x as much, while the tidyr::pivot_longer
approach in this case needs about 12x as much memory, and is about 20x slower.
EDIT: Upon review of the warnings, I realized that my earlier draft was actually using reshape2
under the hood when I called data.table::melt
, since I was feeding it a tibble. Added explicit data.table::melt
solution that includes a translation of the data to data.table. For this data, reshape2::melt
seems to be faster and more memory efficient.
Sample data
n1 = as.integer(2E8/26)
set.seed(42)
data_long <- data.frame(Institution = 1:n1,
G = rep(letters, each = n1),
Z = rpois(26*n1, 10))
data_wide <- data_long %>% pivot_wider(names_from = G, values_from = Z)
Benchmarks
bench::mark(
tidyr::pivot_longer(data_wide, -1),
reshape2::melt(data_wide, id.vars = 1),
data.table::melt(data.table::as.data.table(data_wide), id.vars = 1),
check = FALSE,
iterations = 10
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 tidyr::pivot_longer(data_wide, -1) 26.77s 37.38s 0.0269 10.52GB 0.0538 10 20 6.2m
2 reshape2::melt(data_wide, id.vars = 1) 1.25s 1.73s 0.519 2.23GB 0.156 10 3 19.3s
3 data.table::melt(data.table::as.data.table(data_wide), id.vars = 1) 1.82s 2.41s 0.332 3.01GB 0.232 10 7 30.1s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>