Home > OS >  Looping through Dask array made of npy memmap files increases RAM without ever freeing it
Looping through Dask array made of npy memmap files increases RAM without ever freeing it

Time:05-01

Context

I am trying to load multiple .npy files containing 2D arrays into one big 2D array to process it by chunk later.
All of this data is bigger than my RAM so I am using the memmap storage/loading system here:

pattern = os.path.join(FROM_DIR, '*.npy')
paths = sorted(glob.glob(pattern))
arrays = [np.load(path, mmap_mode='r') for path in paths]
array = da.concatenate(arrays, axis=0)

No problem so far, RAM usage if very low.

Problem

Now that I have my big 2D array, I am looping through it to process data chunk by chunk as such:

chunk_size = 100_000
for i in range(0, 1_000_000, chunk_size):
    subset = np.array(array[i:i 100_000])
    # Process data [...]
    del subset

But even if I execute this block of code without any processing, subsets seem to be loaded into RAM indefinitely.
It is like Dask was loading or copying memmap arrays to real np.arrays behind the scenes. Deleting the variable or calling gc.collect() did not solve this.

CodePudding user response:

you've opened the file for read-only, so presumably your changes aren't getting flushed to disk. It's impossible to say exactly what's happening without knowing what's in # process data but at first glance it seems like you should be using mmap_mode='r ' for starters.

See the memmap docs for a description of the read modes.

In general, though, dask can't help you save memory if you're going to loop through the chunks and manually manipulate the arrays. Dask saves you from memory bottlenecks by managing the computation for you, scheduling operations across the whole array to fit within your system's constraints. So the ideal would be to drop the for loop and write your # process data block in terms of dask.array operations rather than in numpy. If the operation needs to be written in numpy (or other non-dask modules), you can map it across the array with dask.array.map_blocks. As your code is currently written, dask isn't doing anything except adding the overhead of a scheduler.

  • Related