interpolation and appending outputs to list is taking forever in for loop-CodePudding

I am attaching a test excel file named 'df2.xlsx' here: https://docs.google.com/spreadsheets/d/1U55lXyZSYguiQUH0AOB_v8yhKcbMGNQs/edit?usp=sharing&ouid=102781316443126205856&rtpof=true&sd=true that has 58673 lines and I have imported it as a dataframe and using the following attached code calculated 'D50' via interp1d linear interpolation. D50 is 50 percentile values, that's why I needed the interplotion. The columns I have used to interpolate are con13c,con12c,con2c,con3c,......,con11c,con14c. Index of con13c and con14c are 17 and 29. I have stored the outputs into an empty list using append(). However, the performance of the code is slow.

The main excel file/text file will have 4928526 instead of the attached excel files 58673 lines and it took more than 20 minutes to finish the D50 calculation for the main excel file. Let me know if there is a way to make it faster by reading the df chunk by chunk and running into multiprocessor.In the main excel file there will be 100s of different TS value and for each TS value there will be 58673 lines. So in the test excel file 'df2.xlsx' all the data is for only one specific TS. Thanks.

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d


dt=pd.read_excel('df2.xlsx', index_col=0) 

# check column index
dt.columns.get_loc("con14c")
x=[0.00001, 0.00004675, 0.000088,   0.000177,   0.000354,   0.000707,   0.001414,   0.002828,   
                   0.005657,    0.011314,   0.022627,   0.045254,   0.6096]
x=np.array(x)
xx=np.log(x)
dfs =[]
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt.iloc[i,17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    print(np.exp(x_new))
    dfs.append(x_new)
dt['D50']=dfs

CodePudding user response：

I ran a test on my PC and it reduced to 30% of the original run time by doing a simple change. Before it takes 9 seconds now it only takes about 2 seconds.

remove the print
Do not index the pd.DataFrame as it is extremely slow. Convert it to a numpy array first and index on that:

# outside the for loop
dt_arr = dt.values

# ... other codes

y1 = dt_arr[i,17:30]

Since you are only interested in 0.5:

dfs1 = []
dfs2 = []
for i in range(0, len(dt)): # loop through the rows of dt
    y1=dt_arr[i, 17:30]

    y1=np.array(y1,dtype=np.float)
    f = interp1d( y1,xx, kind='linear', bounds_error=False, fill_value=np.log(y1[0])) #fill_value='extrapolate'
    x_new=np.exp(f(.5))
    dfs1.append(x_new)
    
    # I don't know if your data is sorted, if so you can ignore this part
    sort_idx = np.argsort(y1)
    xx_sorted = xx[sort_idx]
    y1_sorted = y1[sort_idx]
    # I think your fill value is a bit weird as you are using same values for both ends. You might want to check that
    if y1_sorted[-1] < 0.5 or y1_sorted[0] > 0.5:
        dfs2.append(y1[0])
    else:
        idx = np.argmax(y1_sorted > 0.5)
        x0 = xx_sorted[idx-1]
        x1 = xx_sorted[idx]
        z0 = y1_sorted[idx-1]
        z1 = y1_sorted[idx]
        dfs2.append(np.exp(x0   (0.5-z0)*(x1-x0)/(z1-z0)))