Code using to convert time is taking too long-CodePudding

I have a dataframe as follows (reproducible data):

np.random.seed(365)
rows = 17000

data = np.random.uniform(20.25, 23.625, size=(rows, 1))

df = pd.DataFrame(data , columns=['Ta'])

'Set index'
Epoch_Start=1636757999
Epoch_End=1636844395
time = np.arange(Epoch_Start,Epoch_End,5)
df['Epoch']=pd.DataFrame(time)

df.reset_index(drop=True, inplace=True)
df=df.set_index('Epoch')

                   
Epoch          Ta      
1636757999  23.427413
1636758004  22.415409
1636758009  22.560560
1636758014  22.236397
1636758019  22.085619
              ...
1636842974  21.342487
1636842979  20.863043
1636842984  22.582027
1636842989  20.756926
1636842994  21.255536

[17000 rows x 1 columns]

Me expected output is: 1.- Column with the date convert from Epochtime to Datetime (Column 'dates' in the return value of function). (example: 2021-11-12 22:59:59)

Heres the code that im using:

def obt_dat(path):
    df2=df
    df2['date'] = df.index.values
    df2['date'] = pd.to_datetime(df2['date'],unit='s')
    df2['hour']=''
    df2['fecha']=''
    df2['dates']=''
    
    start = time.time()
    for i in range(0,len(df2)):
        df2['hour'].iloc[i]=df2['date'].iloc[i].hour 
        df2['fecha'].iloc[i]=str(df2['date'].iloc[i].year) str(df2['date'].iloc[i].month) str(df2['date'].iloc[i].day) 
        df2['dates'] = df2['fecha'].astype(str)   df2['hour'].astype(str)
        
    end = time.time() 
    T=round((end-start)/60,2)
    print('Tiempo de Ejecución Total: '   str(T)   ' minutos')

     
    return(df2)
        

obt_dat(df)

After that im using .groupby to get the mean values from specific hours. But, the problem is that the code is taking to long to execute. Can anyone have an idea to short the elapsed time of the function obt_dat()

CodePudding user response：

You can use the dt (date accessors) to eliminate the loops:

df2 = df.copy()
df2['date'] = df.index.values
df2['date'] = pd.to_datetime(df2['date'], unit='s')

df2['hour'] = df2['date'].dt.hour
df2['fecha'] = df2['date'].dt.strftime('%Y%m%d')
df2['dates'] = df2['date'].dt.strftime('%Y%m%d%H')

Timing with your reproducible example gives:

156 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

Use plain Python - lists or dicts instead of dataframes.

If you really need a dataframe, construct it at the end of CPU-intensive operations.

But that's just my assumption - you might want to do some benchmarking to see how much time each part of the code really takes. "Very long" is relative, but I'm pretty sure that your bottleneck are the dataframe operations you do in the for loop.