I have a dataframe that looks like that:
col1
0 10
1 5
2 8
3 12
4 13
5 6
6 9
7 11
8 10
9 3
10 21
11 18
12 14
13 16
14 30
15 45
16 31
17 40
18 38
For each cell in 'col1' I calculate a range of values:
df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1 df.col1 * 0.2
For each cell there's a range, I would like to count how many cells in 'col1' in the past xx cells (3 in this example) are within that range, but without a loop as it takes a very long time with my actual model.
I'm trying to achieve this result:
col1 df_min df_max counter
0 10 8.0 12.0 -1
1 5 4.0 6.0 -1
2 8 6.4 9.6 -1
3 12 9.6 14.4 1
4 13 10.4 15.6 1
5 6 4.8 7.2 0
6 9 7.2 10.8 0
7 11 8.8 13.2 2
8 10 8.0 12.0 2
9 3 2.4 3.6 0
10 21 16.8 25.2 0
11 18 14.4 21.6 1
12 14 11.2 16.8 0
13 16 12.8 19.2 2
14 30 24.0 36.0 0
15 45 36.0 54.0 0
16 31 24.8 37.2 1
17 40 32.0 48.0 1
18 38 30.4 45.6 3
Here's the (messy) code that I could come up with, but I'd really like a faster solution, if possible. Any help would be gladly appreciated.
df = pd.DataFrame({"col1":[10, 5, 8, 12, 13, 6, 9, 11, 10, 3, 21, 18, 14, 16, 30, 45, 31, 40, 38]})
back = 3 # numbers of cells to check back
df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1 df.col1 * 0.2
l = []
for window in df.col1.rolling(window=back 1, center=False, closed='right'):
if window.empty:
pass
else:
a = window.iloc[-1]
range_min = a - a * 0.2
range_max = a a * 0.2
c = 0
if len(window) == back 1:
for b in window:
if (b >= range_min and b <= range_max):
c = 1
c = c-1 # substract 1 because window includes the tested value which is always true
l.append(c)
df1 = pd.DataFrame(l, columns=['counter'])
df = df.join(df1)
print(df)
CodePudding user response:
loop with vectorization operation
Code
df['df_min'] = df.col1 - df.col1 * 0.2
df['df_max'] = df.col1 df.col1 * 0.2
n = 3
s = pd.Series(dtype='float')
for i in range(0, n):
s1 = df.col1.shift(i 1).ge(df['df_min']) & df.col1.shift(i 1).le(df['df_max'])
s = s.add(s1, fill_value=0)
s[:n] = -1
df['counter'] = s
output(df
):
col1 df_min df_max counter
0 10 8.0 12.0 -1.0
1 5 4.0 6.0 -1.0
2 8 6.4 9.6 -1.0
3 12 9.6 14.4 1.0
4 13 10.4 15.6 1.0
5 6 4.8 7.2 0.0
6 9 7.2 10.8 0.0
7 11 8.8 13.2 2.0
8 10 8.0 12.0 2.0
9 3 2.4 3.6 0.0
10 21 16.8 25.2 0.0
11 18 14.4 21.6 1.0
12 14 11.2 16.8 0.0
13 16 12.8 19.2 2.0
14 30 24.0 36.0 0.0
15 45 36.0 54.0 0.0
16 31 24.8 37.2 1.0
17 40 32.0 48.0 1.0
18 38 30.4 45.6 3.0
i don know your dataset. However, when im testing with 1,000,000 rows and n = 10
, this code takes only 0.4sec.
test example
import numpy as np
df = pd.DataFrame(np.random.randint(20,100, 1000000), columns=['col1'])
CodePudding user response:
Another possible solution, based on pandas.DataFrame.rolling
n = 3
def f(x):
x = x.reset_index(drop=True)
if (len(x) < (n 1)):
result = -1
else:
result = (x.loc[:(n-1), 'col1']
.between(x.loc[n, 'df_min'], x.loc[n, 'df_max']).sum())
return result
rol = df.rolling(n 1)
df['counter'] = [f(x) for x in rol]
Output:
col1 df_min df_max counter
0 10 8.0 12.0 -1
1 5 4.0 6.0 -1
2 8 6.4 9.6 -1
3 12 9.6 14.4 1
4 13 10.4 15.6 1
5 6 4.8 7.2 0
6 9 7.2 10.8 0
7 11 8.8 13.2 2
8 10 8.0 12.0 2
9 3 2.4 3.6 0
10 21 16.8 25.2 0
11 18 14.4 21.6 1
12 14 11.2 16.8 0
13 16 12.8 19.2 2
14 30 24.0 36.0 0
15 45 36.0 54.0 0
16 31 24.8 37.2 1
17 40 32.0 48.0 1
18 38 30.4 45.6 3