I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
CodePudding user response:
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
CodePudding user response:
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
CodePudding user response:
One simple approach using find_peaks
to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict(
{'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
mask = (df["values"] > 0)
_, plateaus = find_peaks(mask.to_numpy(), plateau_size=3)
res = np.zeros(len(mask), dtype=np.int32)
for i, (l, r) in enumerate(zip(plateaus["left_edges"], plateaus["right_edges"]), 1):
res[l:r 1] = i
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
CodePudding user response:
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count = 1
else:
if count >= 3:
n_groups = 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN