Pandas Apply - is there a faster way?-CodePudding

I have some code like so in pandas to get the number of times a person switches restaurants:

Data Extract:

Date        PersonId RestaurantName
01/01/2021  12       McDonalds
01/02/2021  12       McDonalds
01/02/2021  12       Wendys
01/03/2021  12       Popeyes
01/01/2021  14       Popeyes
01/02/2021  14       McDonalds
01/02/2021  14       Wendys
01/02/2021  14       Popeyes

Code that is too slow:

df['SwitchCount'] = df.groupby('PersonId')['RestaurantName'].transform(lambda x: x.shift().ne(x).sum()-1)

How can I optimize this code? My understanding is that transform is very computationally expensive. how can I change this code to be faster but do the same logic?o

CodePudding user response：

Assuming df['Date'] is sorted, I did this and it took a bit less than half the time:

from sklearn.preprocessing import LabelEncoder

def count(data):
    """
    Count the number of changes. Fastest thing I tried.
    """
    return np.nonzero(np.diff(data))[0].size

df['RestaurantId'] = LabelEncoder().fit_transform(df['RestaurantName'])
df.groupby('PersonId')['RestaurantId'].apply(count)

I think you could parallelize this operation with Dask and I daresay it would be faster still, on a large enough dataset.

CodePudding user response：

Here is one idea, Its employing vectorization, which does perform well. However, I don't have large dataset to compare and report back the performance. But, if you give it a try and respond, that'll be appreciated.

Here, the current and next rows are compared and then dropping rows where there is a match. For the Last Row there is no next row, so that is also dropped

df['sameRestaurant'] = df['RestaurantName'] == df['RestaurantName'].shift(-1)
df2 = df.drop(df[df['sameRestaurant'] == True].index)
df2 = df2[:-1]

Next, the Result DF is grouped by person and counted the restaurants, which is finally merged with the original dataframe, after renaming the column name

df2= df2.groupby(['PersonId'])['RestaurantName'].count().reset_index()
df2.rename(columns={'RestaurantName' : 'ChangeCount'}, inplace=True)
df=df.merge(df2, on=['PersonId'])

lastly, delete the column introduced earlier

df.drop(axis=1, columns=['sameRestaurant'])

Full Code

df['sameRestaurant'] = df['RestaurantName'] == df['RestaurantName'].shift(-1)
df2 = df.drop(df[df['sameRestaurant'] == True].index)
df2 = df2[:-1]


df2= df2.groupby(['PersonId'])['RestaurantName'].count().reset_index()
df2.rename(columns={'RestaurantName' : 'ChangeCount'}, inplace=True)

df=df.merge(df2, on=['PersonId'])
df.drop(axis=1, columns=['sameRestaurant'])