I have the following dataframe
test_df = pd.DataFrame({'Category': {0: 'product-availability address-confirmation input',
1: 'registration register-data-confirmation options',
2: 'onboarding return-start input',
3: 'registration register-data-confirmation input',
4: 'decision-tree first-interaction-validation options'},
'Original_UserId': {0: '[email protected]',
1: '[email protected]',
2: '[email protected]',
3: '[email protected]',
4: '[email protected]'}})
Thank to jezrael I am applying the following map, which follows the logic given in this question After certain string is found mark every after string as true,pandas
test_df.groupby('Original_UserId',observed=True)['Category'].apply(lambda s : s.eq('onboarding return-start input').cummax())
Which returns the following series
pd.Series({0: False, 1: False, 2: True, 3: True, 4: True})
The thing is when I apply this condition, to a larger dataset it takes quite a while to run this code. Any clues on how to optimize?
CodePudding user response:
You can use multiprocessing
if your dataframe is large and avoid to use Dask
or PySpark
:
def f(s):
return s.eq('onboarding return-start input').cummax()
if __name__ == '__main__':
# test_df = ...
with mp.Pool(mp.cpu_count()) as pool:
groups = test_df.groupby('Original_UserId',observed=True)['Category']
data = pool.map(f, [g for _, g in groups])
s = pd.concat(data)
Output:
>>> s
0 False
1 False
2 True
3 True
4 True
Name: Category, dtype: bool
CodePudding user response:
First compare column Category
and then use GroupBy.cummax
per column Original_UserId
:
s = (test_df['Category'].eq('onboarding return-start input')
.groupby(test_df['Original_UserId'],observed=True)
.cummax())
print (s)
0 False
1 False
2 True
3 True
4 True
Name: Category, dtype: bool
Another idea is create helper column:
s = (test_df.assign(tmp = test_df['Category'].eq('onboarding return-start input'))
.groupby('Original_UserId',observed=True)['tmp']
.cummax())
print (s)