I am working on populating a binary column based on status values a certain product ID traveled through during its lifetime. Hence, I first sorted my frame by 'id' and 'date'. Now, I would need to code something like:
"If ID currently in status 4 or 6 AND ever been before in status 2 or 3, then frame_ordered['binary'] = 1, else frame_ordered['binary'] = 0"
Accordingly, ID 4556 would be assigned the value 1, and the rest gets 0. I tried working this out with .isin(), but this is rather simplified version, and doesn't yield the correct result. It lacks the time "traveling part": checking the latest status, and all the values beforehand within a given ID. Does anyone have a better idea how to solve this? Would be much appreciated.
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556],'status': [1,2,4,5,6, 3, 4], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
frame_ordered['binary'] = np.where(mask & frame_ordered['id'].isin([4,6]) & frame_ordered['id'].isin([2, 3]), 1, 0)
frame_ordered
CodePudding user response:
The "ever been before in status 2 or 3" condition can be coded using cumsum
counting all previous (and the current) occurrences of 2
or 3
in the group.
import pandas as pd
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248, 4556, 4556, 2345],'status': [1,2,4,5,6, 3, 4, 2], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03', '2022-05-03', '2022-07-01', '2022-05-10']})
frame_ordered = frame.sort_values(['id','date'])
frame_ordered['binary'] = frame_ordered.groupby('id').status.transform(lambda x: (x.isin([4, 6]) & x.isin([2,3]).cumsum().gt(0)).astype(int))
Result:
id status date binary
0 1245 1 2022-07-01 0
4 1248 6 2022-01-03 0
2 2345 4 2022-04-20 0
7 2345 2 2022-05-10 0
3 4556 5 2022-02-02 0
1 4556 2 2022-03-12 0
5 4556 3 2022-05-03 0
6 4556 4 2022-07-01 1
(please note that I added another row for id 2345 where the 4
is followed (instead for preceded) by a 2
and hence should not be marked with a 1
)
CodePudding user response:
First, check whether the latest entry for status
for each id
is in [4, 6]
.
current_state_bool = frame_ordered.drop_duplicates('id', keep='last').set_index('id').status.isin([4, 6])
Second, check whether the list of the status
for each id
contains an element from [2, 3]
.
group = frame_ordered.groupby('id').status.unique()
previous_state_bool = pd.Series([(2 or 3) in entry for entry in group.values], index=group.index)
Then, check whether both conditions are fulfilled.
combined_bool = pd.Series(current_state_bool & previous_state_bool, name='binary').astype('int')
Finally, merge frame_ordered
and combined_bool
.
frame_ordered = pd.merge(frame_ordered.drop('binary', axis=1), combined_bool, on='id', how='left')