Home > Enterprise >  Isolate sequence of positive numbers in a pandas dataframe
Isolate sequence of positive numbers in a pandas dataframe

Time:07-21

I would like to identify what I call "periods" of data stocked in a pandas dataframe.

Let's say i have these values:

   values
1    0
2    8
3    1
4    0
5    5
6    6
7    4
8    7
9    0
10   2
11   9
12   1
13   0

I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.

This would give :

   values  period
1    0      None
2    8      None
3    1      None
4    0      None
5    5       1
6    6       1
7    4       1
8    7       1
9    0      None
10   2       2
11   9       2
12   1       2
13   0      None

CodePudding user response:

Using boolean arithmetics:

N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)

output:

    values  period
1        0     NaN
2        8     NaN
3        1     NaN
4        0     NaN
5        5     1.0
6        6     1.0
7        4     1.0
8        7     1.0
9        0     NaN
10       2     2.0
11       9     2.0
12       1     2.0
13       0     NaN

intermediates:

    values     m1     m2  CS(m1)  m1&m2  CS(m1&m2)  (~m1)&m2  period
1        0   True  False       1  False          0     False     NaN
2        8  False  False       1  False          0     False     NaN
3        1  False  False       1  False          0     False     NaN
4        0   True   True       2   True          1     False     NaN
5        5  False   True       2  False          1      True     1.0
6        6  False   True       2  False          1      True     1.0
7        4  False   True       2  False          1      True     1.0
8        7  False   True       2  False          1      True     1.0
9        0   True   True       3   True          2     False     NaN
10       2  False   True       3  False          2      True     2.0
11       9  False   True       3  False          2      True     2.0
12       1  False   True       3  False          2      True     2.0
13       0   True  False       4  False          2     False     NaN

CodePudding user response:

You can try

sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum()  # continuous same value group

df['period'] = (df[sign.eq(1)]      # Exclude non-positive numbers
                .groupby(m)
                ['values'].filter(lambda col: len(col) >= 3)
                .groupby(m)
                .ngroup()   1
                )
print(df)

    values  period
1        0     NaN
2        8     NaN
3        1     NaN
4        0     NaN
5        5     1.0
6        6     1.0
7        4     1.0
8        7     1.0
9        0     NaN
10       2     2.0
11       9     2.0
12       1     2.0
13       0     NaN

CodePudding user response:

One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:

import numpy as np
import pandas as pd

from scipy.signal import find_peaks

df = pd.DataFrame.from_dict(
    {'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})

mask = (df["values"] > 0)
_, plateaus = find_peaks(mask.to_numpy(), plateau_size=3)

res = np.zeros(len(mask), dtype=np.int32)
for i, (l, r) in enumerate(zip(plateaus["left_edges"], plateaus["right_edges"]), 1):
    res[l:r   1] = i

df["periods"] = res
print(df)

Output

    values  periods
0        0        0
1        8        0
2        1        0
3        0        0
4        5        1
5        6        1
6        4        1
7        7        1
8        0        0
9        2        2
10       9        2
11       1        2
12       0        0

CodePudding user response:

A simple solution:

count = 0
n_groups = 0
seq_idx = [None]*len(df)

for i in range(len(df)):
    
    if df.iloc[i]['values'] > 0:
        count  = 1
    else:
        if count >= 3:
            n_groups  = 1
            seq_idx[i-count: i] = [n_groups]*count
            
        count = 0
df['period'] = seq_idx

Output:

values  period
0   0   NaN
1   8   NaN
2   1   NaN
3   0   NaN
4   5   1.0
5   6   1.0
6   4   1.0
7   7   1.0
8   0   NaN
9   2   2.0
10  9   2.0
11  1   2.0
12  0   NaN
  • Related