Home > Mobile >  populating binary column based on historic values within group Python
populating binary column based on historic values within group Python

Time:08-30

I am working on populating a binary column based on status values a certain product ID traveled through during its lifetime. Hence, I first sorted my frame by 'id' and 'date'. Now, I would need to code something like:

"If ID currently in status 4 or 6 AND ever been before in status 2 or 3, then frame_ordered['binary'] = 1, else frame_ordered['binary'] = 0"

Accordingly, ID 4556 would be assigned the value 1, and the rest gets 0. I tried working this out with .isin(), but this is rather simplified version, and doesn't yield the correct result. It lacks the time "traveling part": checking the latest status, and all the values beforehand within a given ID. Does anyone have a better idea how to solve this? Would be much appreciated.

today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556],'status': [1,2,4,5,6, 3, 4], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
frame_ordered['binary'] = np.where(mask & frame_ordered['id'].isin([4,6]) & frame_ordered['id'].isin([2, 3]), 1, 0)
frame_ordered

CodePudding user response:

The "ever been before in status 2 or 3" condition can be coded using cumsum counting all previous (and the current) occurrences of 2 or 3 in the group.

import pandas as pd

frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556, 2345],'status': [1,2,4,5,6, 3, 4, 2], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01', '2022-05-10']})

frame_ordered = frame.sort_values(['id','date'])
frame_ordered['binary'] = frame_ordered.groupby('id').status.transform(lambda x: (x.isin([4, 6]) & x.isin([2,3]).cumsum().gt(0)).astype(int))

Result:

     id  status        date  binary
0  1245       1  2022-07-01       0
4  1248       6  2022-01-03       0
2  2345       4  2022-04-20       0
7  2345       2  2022-05-10       0
3  4556       5  2022-02-02       0
1  4556       2  2022-03-12       0
5  4556       3  2022-05-03       0
6  4556       4  2022-07-01       1

(please note that I added another row for id 2345 where the 4 is followed (instead for preceded) by a 2 and hence should not be marked with a 1)

CodePudding user response:

First, check whether the latest entry for status for each id is in [4, 6].

current_state_bool = frame_ordered.drop_duplicates('id', keep='last').set_index('id').status.isin([4, 6])

Second, check whether the list of the status for each id contains an element from [2, 3].

group = frame_ordered.groupby('id').status.unique()
previous_state_bool = pd.Series([(2 or 3) in entry for entry in group.values], index=group.index)

Then, check whether both conditions are fulfilled.

combined_bool = pd.Series(current_state_bool & previous_state_bool, name='binary').astype('int')

Finally, merge frame_ordered and combined_bool.

frame_ordered = pd.merge(frame_ordered.drop('binary', axis=1), combined_bool, on='id', how='left')
  • Related