Hi how's it going? I have a giant dataframe and am trying to do a groupby, filter, then count within each group the occurrence of a particular event. The code I have works but doesn't scale well at all, it takes forever to run. Can someone help me with a fast way to perform the same computation? Below is what I have so far reproduced in a dummy example:
dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']
df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events})
def sneeze_by_weather(df):
num_sneeze = df[df['event']=='sneeze'].shape[0]
if num_sneeze==0:
return 0
else:
return num_sneeze
df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()
Is there any way to perform this computation much faster so that it scales when I have millions of rows?
CodePudding user response:
%%timeit for your provided dataframe with my code below:
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 5: 495 µs per loop
df['event'].loc[(df['event'] == 'sneeze')] = 1
df['event'].loc[(df['event'] != 1)] = 0
df
index | date | person | weather | event |
---|---|---|---|---|
0 | 2012-03-30 | dave | rainy | 1 |
1 | 2012-03-30 | mike | sunny | 0 |
2 | 2012-03-30 | mike | cloudy | 1 |
3 | 2012-03-30 | dave | cloudy | 1 |
4 | 2012-03-30 | mike | rainy | 0 |
5 | 2012-03-31 | dave | sunny | 0 |
6 | 2012-03-31 | dave | cloudy | 1 |
7 | 2012-03-31 | dave | sunny | 0 |
8 | 2012-03-31 | mike | cloudy | 1 |
9 | 2012-03-31 | mike | rainy | 1 |
CodePudding user response:
import pandas as pd
dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']
df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events})
df_transformed = pd.DataFrame(df.groupby(['date','person','weather','event'])['event'].count()).rename(columns = {'event':'count'}).reset_index()
df_transformed['0'] = np.where((df_transformed['event'] == 'sneeze') & (df_transformed['count'] != 0),
df_transformed['count'],
0)
df_transformed = df_transformed.drop(labels = ['event','count'], axis = 1)
df_transformed
CodePudding user response:
This should be faster
idx_cols = ['date','person','weather']
idx = pd.MultiIndex.from_frame(df[idx_cols])
df_transformed = (
df.loc[df.event == 'sneeze', idx_cols]
.value_counts()
.reindex(idx, fill_value=0)
.reset_index()
)
Output:
>>> df_transformed
date person weather 0
0 2012-03-30 dave rainy 1
1 2012-03-30 mike sunny 0
2 2012-03-30 mike cloudy 1
3 2012-03-30 dave cloudy 1
4 2012-03-30 mike rainy 0
5 2012-03-31 dave sunny 0
6 2012-03-31 dave cloudy 1
7 2012-03-31 dave sunny 0
8 2012-03-31 mike cloudy 1
9 2012-03-31 mike rainy 1
Other option is to use merge
idx_cols = ['date','person','weather']
counts = (
df.loc[df.event == 'sneeze', idx_cols]
.value_counts()
.reset_index()
)
df_transformed = (
df[idx_cols].merge(counts, on=idx_cols, how='left')
.fillna(0)
.astype({0: int}) # convert the type of the new column (labeled 0) to int. It was float due to NaNs
)
I was overcomplicating... actually, this way is much more straightforward
idx_cols = ['date','person','weather']
df_transformed = (
df.assign(is_sneeze=(df.event == 'sneeze'))
.groupby(idx_cols, as_index=False)['is_sneeze']
.sum()
)