Home > Enterprise >  Alternative of for loop on a column subsetting in pandas
Alternative of for loop on a column subsetting in pandas

Time:07-24

I am stuck at a time consumption related problem in pandas:

The code looks like this:

df = pd.DataFrame({"IDs": [1, 1, 1, 2, 2, 2, 3, 3, 3],
                   "Month": ["01", "02", "01", "01", "02", "01", "01", "02", "01"],
                   "column1": [0.9, 0.5, 0.3, 0.8, 0.5, 0.1, 0.6, 0.2, 0.8]})

df_list = []
for id in df.IDs.unique():
    temp = df[df.IDs == id]
    temp = temp.groupby("Month").mean()
    temp2 = temp['column1'].ewm(span=3, adjust=True).sum()
    df_list.append(temp2)

Note that unique IDs contains around 500k elements, the original dataframe df contains around 6 mil records.

Now I used tqdm to check the estimated time and it's taking 14-15 hrs to complete. If I have even loop only for temp = df[df.IDs == id] line and the estimated time is same (basically those are pandas function, so it should not make any performance issue). So the problem lies in that line.

Is there any alternative way to do this? Thanks for any suggestion.

CodePudding user response:

out = (df.groupby(['IDs', 'Month'])
         .mean()['column1']
         .ewm(span=3, adjust=True)
         .sum())
print(out.reset_index())

Output:

   IDs Month  column1
0    1    01  0.60000
1    1    02  0.80000
2    2    01  0.85000
3    2    02  0.92500
4    3    01  1.16250
5    3    02  0.78125

out = (df.groupby('IDs')
         .apply(lambda x: x.groupby('Month')
                           .mean()['column1']
                           .ewm(span=3, adjust=True)
                           .sum())
         .stack())
print(out.reset_index(name='column1'))

Output:

   IDs Month  column1
0    1    01    0.600
1    1    02    0.800
2    2    01    0.450
3    2    02    0.725
4    3    01    0.700
5    3    02    0.550
  • Related