I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3
CodePudding user response:
Identify blocks by using cumsum
on inverted boolean mask, then group the dataframe by ID
and blocks and use cumsum
on Boolean
to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
CodePudding user response:
df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
CodePudding user response:
You can use a column shift for ID
and Boolean
columns to identify the groups to do the groupby
on. Then do a cumsum
for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3