(Python) Memory-efficient processing of hdf5 stacks-CodePudding

I'm working in python with a large dataset of images, each 3600x1800. There are ~360000 images in total. Each image is added to the stack one by one since an initial processing step is run on each image. H5py has proven effective at building the stack, image by image, without it filling the entire memory.

The analysis I am running is calculated on the grid cells - so on 1x1x360000 slices of the stack. Since the analysis of each slice depends on the max and min values of within that slice, I think it is necessary to hold the 360000-long array in memory. I have a fair bit of RAM to work with (~100GB) but not enough to hold the entire stack of 3600x1800x360000 in memory at once.

This means I need (or I think I need) a time-efficient way of accessing the 360000-long arrays. While h5py is efficient at adding each image to the stack, it seems that slicing perpendicular to the images is much, much slower (hours or more).

Am I missing an obvious method to slice the data perpendicular to the images?

a quick time check yields the following:

file = "file/path/to/large/stack.h5"

t0 = time.time()
with h5py.File(file, 'r') as f:
    dat = f['Merged_liqprec'][:,:,1]
    
print('Time = '   str(time.time()- t0))

t1 = time.time()
with h5py.File(file, 'r') as f:   
    dat = f['Merged_liqprec'][500,500,:]
    
print('Time = '   str(time.time()- t1))

Time = 0.0701

Time = multiple hours, server went offline for maintenance while running

CodePudding user response：

You have the right approach. Using numpy slicing notation to read the stack of interest reduces the memory footprint.

However, with this large dataset, I suspect I/O performance is going to depend on chunking: 1) Did you define chunks= when you created the dataset, and 2) if so, what is the chunked size? The chunk is the shape of the data block used when reading or writing. When an element in a chunk is accessed, the entire chunk is read from disk. As a result, you will not be able to optimize the shape for both writing images (3600x1800x1) and reading stacked slices (1x1x360000). The optimal shape for writing an image is (shape[0], shape[1], 1) and for reading a stacked slice is (1, 1, shape[2])

Tuning the chunk shape is not trivial. h5py docs recommend the chunk size should be between 10 KiB and 1 MiB, larger for larger datasets. You could start with chunks=True (h5py determines the best shape) and see if that helps. Assuming you will create this file once, and read many times, I suggest optimizing for reading. However, creating the file may take a long time. I wrote a simple example that you can use to "tinker" with the chunk shape on a small file to observe the behavior before you work on the large file. The table below shows the effect of different chunk shapes on 2 different file sizes. The code follows the table.

Dataset shape=(36,18,360)

chunks	Writing	Reading
None	0.349	0.102
(36,18,1)	0.187	0.436
(1,1,360)	2.347	0.095

Dataset shape=(144,72,1440) (4x4x4 larger)

chunks	Writing	Reading
None	59.963	1.248
(9, 9, 180) / True	11.334	1.588
(144, 72, 100)	79.844	2.637
(10, 10, 1440)	56.945	1.464

Code below:

f = 4
a0, a1, n = f*36, f*18, f*360
c0, c1, cn = a0, a1, 100
#c0, c1, cn = 10, 10, n

arr = np.random.randint(256,size=(a0,a1))

print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
     # Use chunks=True for default size or use chunks=(c0,c1,cn) for user defined size
     ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True) 
     print(f'chunks={ds.chunks}')
     for i in range(n):
         ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")

start = time.time()       
with h5py.File('SO_73791464.h5','r') as h5f:
     ds = h5f['images']
     for i in range(ds.shape[0]):
         for j in range(ds.shape[1]):    
             dat = ds[i,j,:]
print(f"Time to read file:{(time.time()-start): .3f}")

CodePudding user response：

HDF5 chunked storage improves I/O time, but can still be very slow when reading small slices (e.g., [1,1,360000]). To further improve performance, you need to read larger slices into an array as described by @Jérôme Richard in the comments below your question. You can then quickly access a single slice from the array (because it is in memory).

This answer combines the 2 techniques: 1) HDF5 chunked storage (from my first answer), and 2) reading large slices into an array and then reading a single [i,j] slice from that array. The code to create the file is very similar to the first answer. It is setup to create a dataset of shape [3600, 1800, 100] with default chunk size ([113, 57, 7] on my system). You can increase n to test with larger datasets.

When reading the file, the large slice [0] and [1] dimensions are set equal to the associated chunk shape (so I only access each chunk once). As a result, the process to read the file is "slightly more complicated" (but worth it). There are 2 loops: the 1st loop reads a large slice into an array 'dat_chunk', and the 2nd loop reads a [1,1,:] slice from 'dat_chunk' into a second array 'dat'.

Differences in timing data for 100 images is dramatic. It takes 8 seconds to read all of the data using the method below. It required 74 min with the first answer (reading each [i,j] pair directly). Clearly that is too slow. Just for fun, I increased the dataset size to 1000 images (shape=[3600, 1800, 1000]) in my test file and reran. It takes 4:33 (m:ss) to read all slices with this method. I didn't even try with the previous method (for obvious reasons). Note: my computer is pretty old & slow with a HDD, so your timing data should be faster.

Code to create the file:

a0, a1, n = 3600, 1800, 100
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
     ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True) 
     print(f'chunks={ds.chunks}')
     for i in range(n):
          arr = np.random.randint(256,size=(a0,a1))
          ds[:,:,i] = arr
 print(f"Time to create file:{(time.time()-start): .3f}")

Code to read the file using large array slices:

start = time.time()       
with h5py.File('SO_73791464.h5','r') as h5f:
    ds = h5f['images']
    print(f'shape={ds.shape}')
    print(f'chunks={ds.chunks}')
    ds_i_max, ds_j_max, ds_k_max = ds.shape
    ch_i_max, ch_j_max, ch_k_max = ds.chunks
    
    i = 0
    while i < ds_i_max:
        i_stop = min(i ch_i_max,ds_i_max)    
        print(f'i_range: {i}:{i_stop}')
        j = 0
        while j < ds_j_max:
            j_stop = min(j ch_j_max,ds_j_max)    
            print(f'  j_range: {j}:{j_stop}')
            dat_chunk = ds[i:i_stop,j:j_stop,:]
            # print(dat_chunk.shape)
            for ic in range(dat_chunk.shape[0]):
                for jc in range(dat_chunk.shape[1]):
                    dat = dat_chunk[ic,jc,:]
            j = j_stop
        i = i_stop     
              
print(f"Time to read file:{(time.time()-start): .3f}")