Home > database >  How can I speed this up with Numpy?
How can I speed this up with Numpy?

Time:05-04

So I've been struggling with this for 2 days now and finally managed to make it work, but I wonder if there is a way to speed this up, since i have a loooot of data to process.

The goal here is for each line every column of my dataFrame, I want to compute an incremental sum (elt(n-1) elt(n)), then take the absolute value and compare the local absolute value to previous one in order to, at the last element of my column, obtain the max value. I though simply using a rolling sum or a simple column sum would work but somehow I can't make it. Those max are calculated over 2000 lines, rolling. (so for elt n i take the elements from line n until line n 2000, etc). In the end, I will have a dataframe with a length of the original dataframe, minus 2000 elements.

About the speed, this takes around 1 minute to complete for all 4 columns (and this is for a relatively small file of around 5000 elements only, most of them would be 4 times bigger).

Ideally, i'd like to massively speed up what is inside the "for pulse in range(2000):" loop, but if I can speed up the entire code that's also fine. I'm not sure exactly how I could use list comprehension with this. I checked the numpy accumulate() function, or the rolling() but it does not give me what I want.

edit1: indents.

edit2: here an exemple for the first 10 lines of input and output for the first column only (to make it less busy here). The thing is that you need a minimum of 2000 lines of the input to obtain the first item in the results, so not sure it's really useful here.

   Input : 

   -2.1477511E-12
   2.0970403E-12
   2.0731764E-12
   1.7241669E-12
   1.2260080E-12
   7.3381503E-13
   8.2330457E-13
   -9.2472616E-13
   -1.1275693E-12
   -1.3184806E-12


   Output: 
   2.25436311E-10
   2.28640040E-10
   2.27405083E-10
   2.25331907E-10
   2.23607740E-10
   2.22381732E-10
   2.21647917E-10
   2.20824612E-10
   2.21749338E-10
   2.22876908E-10

Here's my code:

    ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C', 'D']]
    for col in ys_integral_check_reduced.columns:
            pulse=0
            i=0
            while (ys_integral_check_reduced.loc[i 1999,col] != 0 and i<len(ys_integral_check_reduced)-2000):
                cur = 0
                max = 0
                for pulse in range(2000):
                    cur = cur   ys_integral_check_reduced.loc[i pulse, col]
                    if abs(cur) > max:
                        max = abs(cur)
                    pulse = pulse 1
                ys_integral_check_reduced_final.loc[i, col] = max
                i = i 1
     print(ys_integral_check_reduced_final)

CodePudding user response:

If I understood correctly, I created a toy example (WINDOW size of 3).

import pandas as pd

WINDOW = 3

ys_integral_check = pd.DataFrame({'A':[1, 2, -5, -6, 1, -10, -1, -10, 7, 4, 5, 6],
                                  'B':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
ys_integral_check['C'] = -ys_integral_check['B']

Which looks like this:

     A   B   C
0    1   1  -1
1    2   2  -2
2   -5   3  -3
3   -6   4  -4
4    1   5  -5
5  -10   6  -6
6   -1   7  -7
7  -10   8  -8
8    7   9  -9
9    4  10 -10
10   5  11 -11
11   6  12 -12

Your solution gives:

ys_integral_check_reduced_final = pd.DataFrame(columns=['A', 'B', 'C'])
ys_integral_check_reduced = ys_integral_check[['A', 'B', 'C']]

for col in ys_integral_check_reduced.columns:
        pulse=0
        i=0
        while (ys_integral_check_reduced.loc[i WINDOW-1,col] != 0 and i<len(ys_integral_check_reduced)-WINDOW):
            cur = 0
            max = 0
            for pulse in range(WINDOW):
                cur = cur   ys_integral_check_reduced.loc[i pulse, col]
                if abs(cur) > max:
                    max = abs(cur)
                pulse = pulse 1
            ys_integral_check_reduced_final.loc[i, col] = max
            i = i 1
print(ys_integral_check_reduced_final)
    A   B   C
0   3   6   6
1   9   9   9
2  11  12  12
3  15  15  15
4  10  18  18
5  21  21  21
6  11  24  24
7  10  27  27
8  16  30  30

Here is a variant using Pandas and Rolling.apply():

ys_integral_check_reduced_final = ys_integral_check[['A', 'B', 'C']].rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)

Which gives:

      A     B     C
0   3.0   6.0   6.0
1   9.0   9.0   9.0
2  11.0  12.0  12.0
3  15.0  15.0  15.0
4  10.0  18.0  18.0
5  21.0  21.0  21.0
6  11.0  24.0  24.0
7  10.0  27.0  27.0
8  16.0  30.0  30.0
9  15.0  33.0  33.0

There is an extra row, because I believe your solution skips a possible window at the end.

I tested it on a random DataFrame with 100'000 rows and 3 columns and a window size of 2000 and it took 18 seconds to process:

import time
import numpy as np

WINDOW = 2000
DF_SIZE = 100000

test_df = pd.DataFrame(np.random.random((DF_SIZE, 3)), columns=list('ABC'))

t0 = time.time()

test_df.rolling(WINDOW).apply(lambda w: w.cumsum().abs().max()).dropna().reset_index(drop=True)

t1 = time.time()

print(t1-t0) # 18.102170944213867
  • Related