reading hdf5 file from s3 to sagemaker, is the whole file transferred?-CodePudding

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:

dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)

But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ? I suppose that if the whole file is transferred at each call it could cost me a lot.

CodePudding user response：

When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.

You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)

As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.

Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).

dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
     # your way to get a numpy array -- .get() is depreciated:
     data = h5_file.get(dataset_path_in_h5)
     # this is the preferred syntax to return an array:
     data_arr = h5_file[dataset_path_in_h5][()]
     # this returns a h5py dataset object:
     data_ds = h5_file[dataset_path_in_h5]  # deleted [()]