I have a dataframe as follows (reproducible data):
np.random.seed(365)
rows = 17000
data = np.random.uniform(20.25, 23.625, size=(rows, 1))
df = pd.DataFrame(data , columns=['Ta'])
'Set index'
Epoch_Start=1636757999
Epoch_End=1636844395
time = np.arange(Epoch_Start,Epoch_End,5)
df['Epoch']=pd.DataFrame(time)
df.reset_index(drop=True, inplace=True)
df=df.set_index('Epoch')
Epoch Ta
1636757999 23.427413
1636758004 22.415409
1636758009 22.560560
1636758014 22.236397
1636758019 22.085619
...
1636842974 21.342487
1636842979 20.863043
1636842984 22.582027
1636842989 20.756926
1636842994 21.255536
[17000 rows x 1 columns]
Me expected output is: 1.- Column with the date convert from Epochtime to Datetime (Column 'dates' in the return value of function). (example: 2021-11-12 22:59:59)
Heres the code that im using:
def obt_dat(path):
df2=df
df2['date'] = df.index.values
df2['date'] = pd.to_datetime(df2['date'],unit='s')
df2['hour']=''
df2['fecha']=''
df2['dates']=''
start = time.time()
for i in range(0,len(df2)):
df2['hour'].iloc[i]=df2['date'].iloc[i].hour
df2['fecha'].iloc[i]=str(df2['date'].iloc[i].year) str(df2['date'].iloc[i].month) str(df2['date'].iloc[i].day)
df2['dates'] = df2['fecha'].astype(str) df2['hour'].astype(str)
end = time.time()
T=round((end-start)/60,2)
print('Tiempo de Ejecución Total: ' str(T) ' minutos')
return(df2)
obt_dat(df)
After that im using .groupby
to get the mean values from specific hours.
But, the problem is that the code is taking to long to execute. Can anyone have an idea to short the elapsed time of the function obt_dat()
CodePudding user response:
You can use the dt
(date accessors) to eliminate the loops:
df2 = df.copy()
df2['date'] = df.index.values
df2['date'] = pd.to_datetime(df2['date'], unit='s')
df2['hour'] = df2['date'].dt.hour
df2['fecha'] = df2['date'].dt.strftime('%Y%m%d')
df2['dates'] = df2['date'].dt.strftime('%Y%m%d%H')
Timing with your reproducible example gives:
156 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
Use plain Python - lists or dicts instead of dataframes.
If you really need a dataframe, construct it at the end of CPU-intensive operations.
But that's just my assumption - you might want to do some benchmarking to see how much time each part of the code really takes. "Very long" is relative, but I'm pretty sure that your bottleneck are the dataframe operations you do in the for
loop.