How do I optimize a for loop for faster results in Python-CodePudding

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute. Is there any way, I can optimize this code so that it can run effectively.

Current runtime is 7.98 minutes!

Ideally I would want to run this program for 48 files like these and expect a faster run time.

Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view

import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd


#%%
f = h5py.File('mer.h5', 'r')

for key in f.keys():
    #print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
    #print(type(f[key])) # get the object type: usually group or dataset
    ls = list(f.keys())
   



#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']

#print("Group")
#print(group)

#for key in ls:
 #data = f.get(key)   
 #dataset1 = np.array(data)

#length=len(dataset1)


masterdf=pd.DataFrame()


data = f.get(key)   
dataset1 = np.array(data)
#masterdf[key]=dataset1


X = f.get('X')
X_1 = pd.DataFrame(X)

Y = f.get('Y')
Y_1 = pd.DataFrame(Y)





#%%


data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))

for i in data_df.index:
    data_df.iloc[i] = dataset1[0][i]
    
    
#data_df.to_csv("test.csv")


#%%





final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])


k=0

for y in range(len(Y_1)):
    
    for x in range(len(X_1[:-2])):   #X and Y ranges are not same
        
        final.loc[k,'X'] = X_1[0][x]
        final.loc[k,'Y'] = Y_1[0][y]
        final.loc[k,'GHI'] = data_df.iloc[y,x]
        k=k 1
        # print(k)`

CodePudding user response：

we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.

CodePudding user response：

Could you please try the following (file.h5 your file):

import pandas as pd
import h5py

with h5py.File("file.h5", "r") as file:
    df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
    df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
    DHI = file.get("DHI")[0][:, :-2].reshape(-1)

final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]

Some explanations:

First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.