I have some code like so in pandas to get the number of times a person switches restaurants:
Data Extract:
Date PersonId RestaurantName
01/01/2021 12 McDonalds
01/02/2021 12 McDonalds
01/02/2021 12 Wendys
01/03/2021 12 Popeyes
01/01/2021 14 Popeyes
01/02/2021 14 McDonalds
01/02/2021 14 Wendys
01/02/2021 14 Popeyes
Code that is too slow:
df['SwitchCount'] = df.groupby('PersonId')['RestaurantName'].transform(lambda x: x.shift().ne(x).sum()-1)
How can I optimize this code? My understanding is that transform is very computationally expensive. how can I change this code to be faster but do the same logic?o
CodePudding user response:
Assuming df['Date']
is sorted, I did this and it took a bit less than half the time:
from sklearn.preprocessing import LabelEncoder
def count(data):
"""
Count the number of changes. Fastest thing I tried.
"""
return np.nonzero(np.diff(data))[0].size
df['RestaurantId'] = LabelEncoder().fit_transform(df['RestaurantName'])
df.groupby('PersonId')['RestaurantId'].apply(count)
I think you could parallelize this operation with Dask and I daresay it would be faster still, on a large enough dataset.
CodePudding user response:
Here is one idea, Its employing vectorization, which does perform well. However, I don't have large dataset to compare and report back the performance. But, if you give it a try and respond, that'll be appreciated.
Here, the current and next rows are compared and then dropping rows where there is a match. For the Last Row there is no next row, so that is also dropped
df['sameRestaurant'] = df['RestaurantName'] == df['RestaurantName'].shift(-1)
df2 = df.drop(df[df['sameRestaurant'] == True].index)
df2 = df2[:-1]
Next, the Result DF is grouped by person and counted the restaurants, which is finally merged with the original dataframe, after renaming the column name
df2= df2.groupby(['PersonId'])['RestaurantName'].count().reset_index()
df2.rename(columns={'RestaurantName' : 'ChangeCount'}, inplace=True)
df=df.merge(df2, on=['PersonId'])
lastly, delete the column introduced earlier
df.drop(axis=1, columns=['sameRestaurant'])
Full Code
df['sameRestaurant'] = df['RestaurantName'] == df['RestaurantName'].shift(-1)
df2 = df.drop(df[df['sameRestaurant'] == True].index)
df2 = df2[:-1]
df2= df2.groupby(['PersonId'])['RestaurantName'].count().reset_index()
df2.rename(columns={'RestaurantName' : 'ChangeCount'}, inplace=True)
df=df.merge(df2, on=['PersonId'])
df.drop(axis=1, columns=['sameRestaurant'])