Home > database >  How to count cells that are within 2 values, in a range of cells in a pandas dataframe?
How to count cells that are within 2 values, in a range of cells in a pandas dataframe?

Time:12-04

I have a dataframe that looks like that:

    col1
0     10
1      5
2      8
3     12
4     13
5      6
6      9
7     11
8     10
9      3
10    21
11    18
12    14
13    16
14    30
15    45
16    31
17    40
18    38

For each cell in 'col1' I calculate a range of values:

df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1   df.col1 * 0.2

For each cell there's a range, I would like to count how many cells in 'col1' in the past xx cells (3 in this example) are within that range, but without a loop as it takes a very long time with my actual model.

I'm trying to achieve this result:

    col1  df_min  df_max  counter
0     10     8.0    12.0       -1
1      5     4.0     6.0       -1
2      8     6.4     9.6       -1
3     12     9.6    14.4        1
4     13    10.4    15.6        1
5      6     4.8     7.2        0
6      9     7.2    10.8        0
7     11     8.8    13.2        2
8     10     8.0    12.0        2
9      3     2.4     3.6        0
10    21    16.8    25.2        0
11    18    14.4    21.6        1
12    14    11.2    16.8        0
13    16    12.8    19.2        2
14    30    24.0    36.0        0
15    45    36.0    54.0        0
16    31    24.8    37.2        1
17    40    32.0    48.0        1
18    38    30.4    45.6        3

Here's the (messy) code that I could come up with, but I'd really like a faster solution, if possible. Any help would be gladly appreciated.

df = pd.DataFrame({"col1":[10, 5, 8, 12, 13, 6, 9, 11, 10, 3, 21, 18, 14, 16, 30, 45, 31, 40, 38]})

back = 3 # numbers of cells to check back

df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1   df.col1 * 0.2

l = []
for window in df.col1.rolling(window=back 1, center=False, closed='right'):
    if window.empty:
        pass
    else:
        a = window.iloc[-1]
        range_min = a - a * 0.2
        range_max = a   a * 0.2
        c = 0
        if len(window) == back 1:
            for b in window:
                if (b >= range_min and b <= range_max):
                    c  = 1
        c = c-1 # substract 1 because window includes the tested value which is always true
        l.append(c)
df1 = pd.DataFrame(l, columns=['counter'])

df = df.join(df1)

print(df)

CodePudding user response:

loop with vectorization operation

Code

df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1   df.col1 * 0.2
n = 3
s = pd.Series(dtype='float')
for i in range(0, n):
    s1 = df.col1.shift(i 1).ge(df['df_min']) & df.col1.shift(i 1).le(df['df_max'])
    s = s.add(s1, fill_value=0)
s[:n] = -1
df['counter'] = s

output(df):

    col1    df_min  df_max  counter
0   10      8.0     12.0    -1.0
1   5       4.0     6.0     -1.0
2   8       6.4     9.6     -1.0
3   12      9.6     14.4    1.0
4   13      10.4    15.6    1.0
5   6       4.8     7.2     0.0
6   9       7.2     10.8    0.0
7   11      8.8     13.2    2.0
8   10      8.0     12.0    2.0
9   3       2.4     3.6     0.0
10  21      16.8    25.2    0.0
11  18      14.4    21.6    1.0
12  14      11.2    16.8    0.0
13  16      12.8    19.2    2.0
14  30      24.0    36.0    0.0
15  45      36.0    54.0    0.0
16  31      24.8    37.2    1.0
17  40      32.0    48.0    1.0
18  38      30.4    45.6    3.0



i don know your dataset. However, when im testing with 1,000,000 rows and n = 10, this code takes only 0.4sec.


test example

import numpy as np
df = pd.DataFrame(np.random.randint(20,100, 1000000), columns=['col1'])

CodePudding user response:

Another possible solution, based on pandas.DataFrame.rolling

n = 3

def f(x):
    x = x.reset_index(drop=True)
    
    if (len(x) < (n 1)):
        result = -1
    else:
        result = (x.loc[:(n-1), 'col1']
                  .between(x.loc[n, 'df_min'], x.loc[n, 'df_max']).sum())
        
    return result
        
    
rol = df.rolling(n 1)

df['counter'] = [f(x) for x in rol]

Output:

    col1  df_min  df_max  counter
0     10     8.0    12.0       -1
1      5     4.0     6.0       -1
2      8     6.4     9.6       -1
3     12     9.6    14.4        1
4     13    10.4    15.6        1
5      6     4.8     7.2        0
6      9     7.2    10.8        0
7     11     8.8    13.2        2
8     10     8.0    12.0        2
9      3     2.4     3.6        0
10    21    16.8    25.2        0
11    18    14.4    21.6        1
12    14    11.2    16.8        0
13    16    12.8    19.2        2
14    30    24.0    36.0        0
15    45    36.0    54.0        0
16    31    24.8    37.2        1
17    40    32.0    48.0        1
18    38    30.4    45.6        3
  • Related