Consider the following data set:
How can I generate the expected value, ExpectedGroup
such that the same value exists when True, but changes and increments by 1, when we run into a False statement in case_id
.
df = pd.DataFrame([
['A', 'P', 'O', 2, np.nan],
['A', 'O', 'O', 5, 1],
['A', 'O', 'O', 10, 1],
['A', 'O', 'P', 4, np.nan],
['A', 'P', 'P', 300, np.nan],
['A', 'P', 'O', 2, np.nan],
['A', 'O', 'O', 5, 2],
['A', 'O', 'O', 10, 2],
['A', 'O', 'P', 4, np.nan],
['A', 'P', 'P', 300, np.nan],
['B', 'P', 'O', 2, np.nan],
['B', 'O', 'O', 5, 3],
['B', 'O', 'O', 10, 3],
['B', 'O', 'P', 4, np.nan],
['B', 'P', 'P', 300, np.nan],
],
columns = ['ID', 'FromState', 'ToState', 'Hours', 'ExpectedGroup'])
# create boolean mask
df['case_id'] = ( (df.FromState == 'O') & (df.ToState == 'O') )
0 False
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
9 False
10 False
11 True
12 True
13 False
14 False
Name: case_id, dtype: boo
# but how to get incrementing groups?
np.where(df['case_id'] != False, df['case_id'].cumsum(), np.nan)
CodePudding user response:
You can use diff
to select only the first item of each stretch of True
:
df['ExpectedGroup'] = (df['case_id'].diff()
&df['case_id']
).cumsum().where(df['case_id'])
If you don't want the intermediate column:
s = (df.FromState == 'O') & (df.ToState == 'O')
# or
# s = df[['FromState', 'ToState']].eq('O').all(axis=1)
df['ExpectedGroup'] = (s.diff()&s).cumsum().where(s)
# or
# df.loc[s, 'ExpectedGroup'] = (s.diff()&s).cumsum()
Output:
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 B P O 2 NaN False
6 B O O 5 2.0 True
7 B O O 10 2.0 True
8 B O P 4 NaN False
9 B P P 300 NaN False
CodePudding user response:
Let's use cumsum
to create counter then reencode the counter using factorize
m = df['case_id']
df.loc[m, 'ExpectedGroup'] = (~m).cumsum()[m].factorize()[0] 1
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 A P O 2 NaN False
6 A O O 5 2.0 True
7 A O O 10 2.0 True
8 A O P 4 NaN False
9 A P P 300 NaN False
10 B P O 2 NaN False
11 B O O 5 3.0 True
12 B O O 10 3.0 True
13 B O P 4 NaN False
14 B P P 300 NaN False
CodePudding user response:
Similar to mozway's brilliant approach:
df['ExpectedGroup'] = (df['case_id'].shift(-1) & df['case_id']).cumsum().mask(~s)
df
ID FromState ToState Hours ExpectedGroup case_id
0 A P O 2 NaN False
1 A O O 5 1.0 True
2 A O O 10 1.0 True
3 A O P 4 NaN False
4 A P P 300 NaN False
5 A P O 2 NaN False
6 A O O 5 2.0 True
7 A O O 10 2.0 True
8 A O P 4 NaN False
9 A P P 300 NaN False
10 B P O 2 NaN False
11 B O O 5 3.0 True
12 B O O 10 3.0 True
13 B O P 4 NaN False
14 B P P 300 NaN False