So I've been struggling with this for 2 days now and finally managed to make it work, but I wonder if there is a way to speed this up, since i have a loooot of data to process.
The goal here is for each line every column of my dataFrame, I want to compute an incremental sum (elt(n-1) elt(n)), then take the absolute value and compare the local absolute value to previous one in order to, at the last element of my column, obtain the max value. I though simply using a rolling sum or a simple column sum would work but somehow I can't make it. Those max are calculated over 2000 lines, rolling. (so for elt n i take the elements from line n until line n 2000, etc). In the end, I will have a dataframe with a length of the original dataframe, minus 2000 elements.
About the speed, this takes around 1 minute to complete for all 4 columns (and this is for a relatively small file of around 5000 elements only, most of them would be 4 times bigger).
Ideally, i'd like to massively speed up what is inside the "for pulse in range(2000):" loop, but if I can speed up the entire code that's also fine. I'm not sure exactly how I could use list comprehension with this. I checked the numpy accumulate() function, or the rolling() but it does not give me what I want.
edit1: indents.
edit2: here an exemple for the first 10 lines of input and output for the first column only (to make it less busy here). The thing is that you need a minimum of 2000 lines of the input to obtain the first item in the results, so not sure it's really useful here.
Input :
-2.1477511E-12
2.0970403E-12
2.0731764E-12
1.7241669E-12
1.2260080E-12
7.3381503E-13
8.2330457E-13
-9.2472616E-13
-1.1275693E-12
-1.3184806E-12
Output:
2.25436311E-10
2.28640040E-10
2.27405083E-10
2.25331907E-10
2.23607740E-10
2.22381732E-10
2.21647917E-10
2.20824612E-10
2.21749338E-10
2.22876908E-10
Here's my code:
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C', 'D']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i 1999,col] != 0 and i<len(ys_integral_check_reduced)-2000):
cur = 0
max = 0
for pulse in range(2000):
cur = cur ys_integral_check_reduced.loc[i pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse 1
ys_integral_check_reduced_final.loc[i, col] = max
i = i 1
print(ys_integral_check_reduced_final)
CodePudding user response:
If I understood correctly, I created a toy example (WINDOW
size of 3).
import pandas as pd
WINDOW = 3
ys_integral_check = pd.DataFrame({'A':[1, 2, -5, -6, 1, -10, -1, -10, 7, 4, 5, 6],
'B':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
ys_integral_check['C'] = -ys_integral_check['B']
Which looks like this:
A B C
0 1 1 -1
1 2 2 -2
2 -5 3 -3
3 -6 4 -4
4 1 5 -5
5 -10 6 -6
6 -1 7 -7
7 -10 8 -8
8 7 9 -9
9 4 10 -10
10 5 11 -11
11 6 12 -12
Your solution gives:
ys_integral_check_reduced_final = pd.DataFrame(columns=['A', 'B', 'C'])
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C']]
for col in ys_integral_check_reduced.columns:
pulse=0
i=0
while (ys_integral_check_reduced.loc[i WINDOW-1,col] != 0 and i<len(ys_integral_check_reduced)-WINDOW):
cur = 0
max = 0
for pulse in range(WINDOW):
cur = cur ys_integral_check_reduced.loc[i pulse, col]
if abs(cur) > max:
max = abs(cur)
pulse = pulse 1
ys_integral_check_reduced_final.loc[i, col] = max
i = i 1
print(ys_integral_check_reduced_final)
A B C
0 3 6 6
1 9 9 9
2 11 12 12
3 15 15 15
4 10 18 18
5 21 21 21
6 11 24 24
7 10 27 27
8 16 30 30
Here is a variant using Pandas and Rolling.apply()
:
ys_integral_check_reduced_final = ys_integral_check[['A', 'B', 'C']].rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
Which gives:
A B C
0 3.0 6.0 6.0
1 9.0 9.0 9.0
2 11.0 12.0 12.0
3 15.0 15.0 15.0
4 10.0 18.0 18.0
5 21.0 21.0 21.0
6 11.0 24.0 24.0
7 10.0 27.0 27.0
8 16.0 30.0 30.0
9 15.0 33.0 33.0
There is an extra row, because I believe your solution skips a possible window at the end.
I tested it on a random DataFrame with 100'000 rows and 3 columns and a window size of 2000 and it took 18 seconds to process:
import time
import numpy as np
WINDOW = 2000
DF_SIZE = 100000
test_df = pd.DataFrame(np.random.random((DF_SIZE, 3)), columns=list('ABC'))
t0 = time.time()
test_df.rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)
t1 = time.time()
print(t1-t0) # 18.102170944213867