Home > database >  Pandas create counter column for group but reset count based on multiple conditions
Pandas create counter column for group but reset count based on multiple conditions

Time:02-27

I have the following Dataframe:

Worker  dt_diff          same_employer  same_role
1754    0 days 00:00:00  False          False
2951    0 days 00:00:00  False          False
2951    1 days 00:00:00  True           True
2951    1 days 01:00:00  True           True
3368    0 days 00:00:00  False          False
3368    7 days 00:00:00  True           True
3368    7 days 00:00:00  True           True
3368    7 days 00:00:00  True           True
3368    7 days 00:00:00  True           True
3368    7 days 00:00:00  True           True
3539    0 days 00:00:00  False          False
3539    1 days 00:00:00  True           True
3539    1 days 00:00:00  True           True
3539    3 days 00:30:00  False          False
3539    1 days 00:00:00  True           True
3539    2 days 06:00:00  False          True

I would like to create a new column containing continuity counter grouped by worker. However the counter will be based on the following conditions:

if (dt_diff > 6days) or (same_employer == False) or (same_role == False) then reset the counter

So for the above dataframe i would expect result as below:

Worker  Counter
1754    1
2951    3
3368    1
3539    3

CodePudding user response:

You description is not highly explicit, but IIUC, you want the last continuity.

For this you can use boolean masks and groupby. Use cummin on the reversed boolean series to only keep the rows after the last False (add 1 to count it).

s = df['dt_diff'].lt('6d') & (df['same_employer'] | df['same_rosle'])

out = s.groupby(df['Worker']).apply(lambda x:x[::-1].cummin().sum() 1)

Output:

Worker
1754    1
2951    3
3368    1
3539    3
dtype: int64

CodePudding user response:

I expect your expected counter for the worker 3539 to be 1 because the last row should have reset it.

Your condition:

s =  ~((df['dt_diff'].dt.days > 6) | (df['same_employer'] == False) | (df['same_role'] == False))

The key is to count from the last row up to the last row that does not satisfy your condition, and we can create a mask for that by:

y = s[::-1].groupby(df['Worker']).cumprod()

then we sum over the mask, but adding 1 at last

print(y.groupby(df['Worker']).sum() 1)

Worker
1754    1
2951    3
3368    1
3539    1
dtype: int64
  • Related