Home > Software engineering >  What's an efficient way to groupby, filter and count the occurrence of a particular value withi
What's an efficient way to groupby, filter and count the occurrence of a particular value withi

Time:05-30

Hi how's it going? I have a giant dataframe and am trying to do a groupby, filter, then count within each group the occurrence of a particular event. The code I have works but doesn't scale well at all, it takes forever to run. Can someone help me with a fast way to perform the same computation? Below is what I have so far reproduced in a dummy example:

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

def sneeze_by_weather(df):
    num_sneeze = df[df['event']=='sneeze'].shape[0] 
    if num_sneeze==0:
        return 0
    else:
        return num_sneeze

df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()

Link to resulting dataframe

Is there any way to perform this computation much faster so that it scales when I have millions of rows?

CodePudding user response:

%%timeit for your provided dataframe with my code below:

The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 5: 495 µs per loop

df['event'].loc[(df['event'] == 'sneeze')] = 1
df['event'].loc[(df['event'] != 1)] = 0
df
index date person weather event
0 2012-03-30 dave rainy 1
1 2012-03-30 mike sunny 0
2 2012-03-30 mike cloudy 1
3 2012-03-30 dave cloudy 1
4 2012-03-30 mike rainy 0
5 2012-03-31 dave sunny 0
6 2012-03-31 dave cloudy 1
7 2012-03-31 dave sunny 0
8 2012-03-31 mike cloudy 1
9 2012-03-31 mike rainy 1

CodePudding user response:

import pandas as pd

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

df_transformed = pd.DataFrame(df.groupby(['date','person','weather','event'])['event'].count()).rename(columns = {'event':'count'}).reset_index()
df_transformed['0'] = np.where((df_transformed['event'] == 'sneeze') & (df_transformed['count'] != 0),
                              df_transformed['count'],
                              0)

df_transformed = df_transformed.drop(labels = ['event','count'], axis = 1)

df_transformed

CodePudding user response:

This should be faster

idx_cols = ['date','person','weather']
idx = pd.MultiIndex.from_frame(df[idx_cols])

df_transformed = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reindex(idx, fill_value=0)
      .reset_index()
)

Output:

>>> df_transformed

         date person weather  0
0  2012-03-30   dave   rainy  1
1  2012-03-30   mike   sunny  0
2  2012-03-30   mike  cloudy  1
3  2012-03-30   dave  cloudy  1
4  2012-03-30   mike   rainy  0
5  2012-03-31   dave   sunny  0
6  2012-03-31   dave  cloudy  1
7  2012-03-31   dave   sunny  0
8  2012-03-31   mike  cloudy  1
9  2012-03-31   mike   rainy  1

Other option is to use merge

idx_cols = ['date','person','weather']

counts = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reset_index()
)

df_transformed = ( 
    df[idx_cols].merge(counts, on=idx_cols, how='left')
                .fillna(0)
                .astype({0: int})  # convert the type of the new column (labeled 0) to int. It was float due to NaNs 
)

I was overcomplicating... actually, this way is much more straightforward

idx_cols = ['date','person','weather']

df_transformed = (
    df.assign(is_sneeze=(df.event == 'sneeze'))
      .groupby(idx_cols, as_index=False)['is_sneeze']
      .sum()
)
  • Related