Memory usage of pivot_longer run on small object-CodePudding

I'm working on a dataframe consisting of 528 column and 2,643,246 rows. Eight of these are character-variables, and the rest integers. In total, this adds up to 11.35 GiB of data, with my available RAM being at 164 GiB.

I now wanted to run a pivot_longer on said dataframe, having one row for each column two ID variables (year and institution). There are a total of 671,370 institutions over 76 years. So atm the data are structured such as this:

Institution	Year	X	Y	Z
A	1	2	1	3
A	2	3	4	4
B	1	3	4	2
B	2	5	3	2

Where I would like to change it so the structure becomes:

Institution	Year	G	N
A	1	X	2
A	1	Y	1
A	1	Z	3
A	2	X	3
A	2	Y	1
A	2	Z	4
B	1	X	3
B	1	Y	4
B	1	Z	2
B	2	X	5
B	2	Y	3
B	2	Z	2

To achieve this I attempted the following code:

library(tidyverse)
Df <- Df  %>% pivot_longer(17:527,
           names_to = "G",
           values_to = "N"
           )

When running this on a small sample-data I manage to achieve the expected results, however when attempting to do the same on the whole dataset I quickly run out of memory. From the object using 11 GiB of memory, it quickly increased to above 150 GiB before returning a "cannot allocate vector of size x Gb" error.

Since I haven't added any data, I can't quite understand where the extra memory usage is coming from. What I wonder therefore is what creates this increase, and whether there is a more efficient way to solve this problem through some other code. Thanks in advance for any help!

CodePudding user response：

I have no ways to test the code on your data, but here is one idea.

The idea is to conduct the wide to long transformation for a chunk of rows one at a time, store the outcome in a list. In the end, combine the list to the final data frame. Hopefully this reduces the memory usage.

If not working, try to see if melt from data.table can convert the data more efficiently.

One other idea that could be helpful. Perhaps subset the Df by removing column 1 to 16 before the wide to long transformation, just keep an ID column. You can join column 1 to 16 back to the converted data frame later.

library(tidyverse)

Df_list <- list()

Institution <- unique(DF$Institution)

for (i in Institution){
  Df_temp <- Df  %>%
    filter(Institution %in% i) %>%
    pivot_longer(17:527, names_to = "G", values_to = "N")
  Df_list[[i]] <- Df_temp
}

Df <- bind_rows(Df_list)

CodePudding user response：

For this size data, reshape2 or data.table's pivoting will likely be more memory-efficient than tidyr's. On a smaller 800MB sample, reshape2::melt requires around 2.6x as much memory as the original data, data.table::melt requires around 3.6x as much, while the tidyr::pivot_longer approach in this case needs about 12x as much memory, and is about 20x slower.

EDIT: Upon review of the warnings, I realized that my earlier draft was actually using reshape2 under the hood when I called data.table::melt, since I was feeding it a tibble. Added explicit data.table::melt solution that includes a translation of the data to data.table. For this data, reshape2::melt seems to be faster and more memory efficient.

Sample data

n1 = as.integer(2E8/26)
set.seed(42)
data_long <- data.frame(Institution = 1:n1,
                        G = rep(letters, each = n1),
                        Z = rpois(26*n1, 10))
data_wide <- data_long %>% pivot_wider(names_from = G, values_from = Z)

Benchmarks

bench::mark(
  tidyr::pivot_longer(data_wide, -1),
  reshape2::melt(data_wide, id.vars = 1),
  data.table::melt(data.table::as.data.table(data_wide), id.vars = 1),
  check = FALSE,
  iterations = 10
)

# A tibble: 3 x 13
  expression                                                               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                                          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 tidyr::pivot_longer(data_wide, -1)                                    26.77s   37.38s    0.0269   10.52GB   0.0538    10    20       6.2m
2 reshape2::melt(data_wide, id.vars = 1)                                 1.25s    1.73s    0.519     2.23GB   0.156     10     3      19.3s
3 data.table::melt(data.table::as.data.table(data_wide), id.vars = 1)    1.82s    2.41s    0.332     3.01GB   0.232     10     7      30.1s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>