I am stuck at a time consumption related problem in pandas:
The code looks like this:
df = pd.DataFrame({"IDs": [1, 1, 1, 2, 2, 2, 3, 3, 3],
"Month": ["01", "02", "01", "01", "02", "01", "01", "02", "01"],
"column1": [0.9, 0.5, 0.3, 0.8, 0.5, 0.1, 0.6, 0.2, 0.8]})
df_list = []
for id in df.IDs.unique():
temp = df[df.IDs == id]
temp = temp.groupby("Month").mean()
temp2 = temp['column1'].ewm(span=3, adjust=True).sum()
df_list.append(temp2)
Note that unique IDs
contains around 500k elements, the original dataframe df contains around 6 mil records.
Now I used tqdm to check the estimated time and it's taking 14-15 hrs to complete. If I have even loop only for temp = df[df.IDs == id]
line and the estimated time is same (basically those are pandas function, so it should not make any performance issue). So the problem lies in that line.
Is there any alternative way to do this? Thanks for any suggestion.
CodePudding user response:
out = (df.groupby(['IDs', 'Month'])
.mean()['column1']
.ewm(span=3, adjust=True)
.sum())
print(out.reset_index())
Output:
IDs Month column1
0 1 01 0.60000
1 1 02 0.80000
2 2 01 0.85000
3 2 02 0.92500
4 3 01 1.16250
5 3 02 0.78125
out = (df.groupby('IDs')
.apply(lambda x: x.groupby('Month')
.mean()['column1']
.ewm(span=3, adjust=True)
.sum())
.stack())
print(out.reset_index(name='column1'))
Output:
IDs Month column1
0 1 01 0.600
1 1 02 0.800
2 2 01 0.450
3 2 02 0.725
4 3 01 0.700
5 3 02 0.550