Dataframe:
d = {'id': [1,1,1,1,1,1,2,2,2,2,3], 'log_date' : ['2021-01-05 07:23:00', '2021-01-05 07:24:00', '2021-01-05 07:25:00', '2021-01-05 07:26:00', '2021-01-05 07:27:00', '2021-01-05 07:28:00', '2021-01-06 07:23:00', '2021-01-06 07:24:00', '2021-01-06 07:25:00', '2021-01-06 07:23:00', '2021-01-05 07:20:00'], 'is_bool': [False,True,True,True,True,True,False,True,True,True,True ]}
df = pd.DataFrame(data=d)
df
Goal / Output: Filtered dataframe with unique ID and few other columns which have 4 or more consecutive value of True
| id | log_date | max_count |
|----|------------|-----------|
| 1 | 2021-01-05 | 5 |
Attempt: Create a new column counter and add counter comparing lag bool value. Then filter rows with value ge 4.
Pseudocode function:
for i in range(len(df)):
if i == 0:
if df.loc[i,'is_bool']:
df.loc[i,'counter'] = 1
else:
df.loc[i,'counter'] = 0
elif df.loc[i,'id'] == df.loc[i-1,'id'] and df.loc[i,'is_bool']:
df.loc[i,'counter'] = df.loc[i-1,'counter'] 1
else:
df.loc[i,'counter'] = 0
How can I use apply method to call the function. Or any better way of achieving this.
CodePudding user response:
We can do rolling
with groupby
n = 4
df['counter'] = (df.groupby('id')['is_bool'].rolling(n, min_periods=1).sum()
.reset_index(level=0,drop=True))
Out[163]:
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 0.0
6 1.0
7 2.0
8 3.0
9 1.0
Name: is_bool, dtype: float64
Update
cond = df.groupby([df['id'],(~df['is_bool']).cumsum()])['is_bool'].cumsum()
out = df[cond == cond.max()].copy()
out['max_count'] = cond.max()
out
Out[220]:
id log_date is_bool max_count
5 1 2021-01-05 07:28:00 True 5
CodePudding user response:
Only keeping groups with 4 or more consecutive True
df.groupby('id').filter(lambda x: x['is_bool'].rolling(4).sum().max() >= 4)
Output:
id is_bool
0 1 False
1 1 True
2 1 True
3 1 True
4 1 True
Opposite:
df.groupby('id').filter(lambda x: x['is_bool'].rolling(4).sum().max() < 4)
Output:
id is_bool
5 2 False
6 2 True
7 2 True
8 2 True
CodePudding user response:
Converting the bool entries to integers and using the fact that at least 4 subsequent True
entries (1
when converted to integers) have a rolling sum >=4
yields
list(df[df['is_bool'].astype(int).rolling(4).sum() >= 4.0]['id'])
> [1, 3]