Optimization of map, in grouped by object-CodePudding

I have the following dataframe

test_df = pd.DataFrame({'Category': {0: 'product-availability address-confirmation input',
  1: 'registration register-data-confirmation options',
  2: 'onboarding return-start input',
  3: 'registration register-data-confirmation input',
  4: 'decision-tree first-interaction-validation options'},
 'Original_UserId': {0: '[email protected]',
  1: '[email protected]',
  2: '[email protected]',
  3: '[email protected]',
  4: '[email protected]'}})

Thank to jezrael I am applying the following map, which follows the logic given in this question After certain string is found mark every after string as true,pandas

test_df.groupby('Original_UserId',observed=True)['Category'].apply(lambda s : s.eq('onboarding return-start input').cummax())

Which returns the following series

pd.Series({0: False, 1: False, 2: True, 3: True, 4: True})

The thing is when I apply this condition, to a larger dataset it takes quite a while to run this code. Any clues on how to optimize?

CodePudding user response：

You can use multiprocessing if your dataframe is large and avoid to use Dask or PySpark:

def f(s):
    return s.eq('onboarding return-start input').cummax()

if __name__ == '__main__':
    # test_df = ...
    with mp.Pool(mp.cpu_count()) as pool:
        groups = test_df.groupby('Original_UserId',observed=True)['Category']
        data = pool.map(f, [g for _, g in groups])
    s = pd.concat(data)

Output:

>>> s
0    False
1    False
2     True
3     True
4     True
Name: Category, dtype: bool

CodePudding user response：

First compare column Category and then use GroupBy.cummax per column Original_UserId:

s = (test_df['Category'].eq('onboarding return-start input')
                        .groupby(test_df['Original_UserId'],observed=True)
                        .cummax())
print (s)
0    False
1    False
2     True
3     True
4     True
Name: Category, dtype: bool

Another idea is create helper column:

s = (test_df.assign(tmp = test_df['Category'].eq('onboarding return-start input'))
            .groupby('Original_UserId',observed=True)['tmp']
            .cummax())
print (s)