Faster alternative to perform Pandas groupby.apply with a multiple conditions on other columns-CodePudding

Let's say we have the following dataframe:

df = pd.DataFrame({"ID": [20, 10, 20, 20, 10, 10, 10, 20, 20, 20, 10, 20, 30, 30], 
                   "Revenue": [90, 89, 80, 95, 99, 59, 70, 95, 78, 85, 82, 71, 78, 88],
                   "Chance": [True, True, False, True, False, False, False, True, True, True, False, True, True, False],
                   "City": ["Strasbourg", "Koln", "Alger", "Casa", "Mosco", "London", "Montpellier", "Barcelone", "Lyon", "Madrid", "Milan", "NYC", "Torino", "Paris"]
                  })

The df should be grouped by ID and 2 columns should be created as follows:
-'NotAccepted' cities: cities which have: 'Revenue' below 84 OR 'Chance' is False:
-'Accepted' cities are the rest.
I have to concatenate the names of all 'Accepted' and 'NotAccepted' cities into these columns. (Please check the final photo below)

My solution is as follows:

df_result=df.groupby('ID').apply(lambda g: pd.Series({
    'NotAccepted': [','.join(g['City'][(g['Chance'] == False) | (g['Revenue'] < 84)])],                                          
    'Accepted': [','.join(g['City'][(g['Chance'] == True) & (g['Revenue'] >= 84)])],
                                             })).reset_index()

The real df is actually large and takes a long time to group/apply. Is there a faster alternative than my solution? Thank you.

CodePudding user response：

You can just build a col before groupby

df['new'] = np.where((df['Chance'] == False) | (df['Revenue'] < 84),'NotAccepted','Accepted')
out = df.groupby(['ID','new'])['City'].agg(list).unstack(fill_value=[])

CodePudding user response：

Use Dask Groupby - I faced a similar issue and received about 1 order of magnitude speedup. This allows you to run across multiple CPUs rather than being single-thread bound.

https://examples.dask.org/dataframes/02-groupby.html

I would imagine there are ways to send to GPU/multi-node as well.