Home > database >  Is there a more optimal way of filtering Pandas dataframe by a column value, when that column is ord
Is there a more optimal way of filtering Pandas dataframe by a column value, when that column is ord

Time:03-14

In Pandas, I have a large DF with millions of rows. There are typically thousands of rows per a particular date, all of them relevant to a particular event.

I want to iterate through, processing in groups the rows with a shared date. For example, this is my current approach:

# Gets me unique date strings
dates = df['date'].unique()

for date in dates:
  dateDF = df[df['date'] == date]
  # Do stuff with these df rows

Iterating this across thousands of dates is relatively slow. The slow part is the filtering. I imagine it's searching every element within the larger data frame, for ones which share the same date. However - all of the DF elements are chronological. So for the first date, which may be 1/1/22 10:00:00 0000, rows 0-1000 may share this date, and then none of the other millions of rows share it. The next one, may be rows 1001-2000.

So I wonder if there's a more optimal way to do this: I know that all of these rows which share a date are grouped together, so there's no need to check beyond the one after the last one, which will have a different date.

Is there a more optimal way I could be doing this?

CodePudding user response:

I haven't so big data to test it but it has .groupby() and later you can works with every group separatelly or run some functions on all groups at once - i.e. .sum(), .min(), .max() or even .apply().

df.groupby('date').sum()

df.groupby('date').min()

df.groupby('date').max()

df.groupby('date').apply(lambda x: -x['value'] * 2 )

You can use groupby with for-loop to work with every group separatelly

for date, grp in df.groupby('date'):
    print('\n---', date, '---\n')
    print(grp[:3])

Minimal working code with 1_000_000 rows and 6 columns but only with 25 groups/dates. So every date has

import pandas as pd
import random
import time

# randome data - but with `seed(0)` it generates always the same data
random.seed(0)
df = pd.DataFrame({
    'date': [ f'2022.03.{i:02}' for _ in range(40_000)
                for i in range(25)],
    'value1': random.choices(range(10), k=1_000_000),
    'value2': random.choices(range(10), k=1_000_000),
    'value3': random.choices(range(10), k=1_000_000),
    'value4': random.choices(range(10), k=1_000_000),
    'value5': random.choices(range(10), k=1_000_000),
})

start = time.time()
print('\n--- sum ---\n')
print(df.groupby('date').sum())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')

start = time.time()
print('\n--- min ---\n')
print(df.groupby('date').min())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')

start = time.time()
print('\n--- max ---\n')
print(df.groupby('date').max())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')

start = time.time()
print('\n--- apply ---\n')
print(df.groupby('date').apply(lambda x: -x['value1'] * 2 ))
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')

start = time.time()
for date, grp in df.groupby('date'):
    print('\n---', date, '---\n')
    print(grp[:3])
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')

Result (with times):

--- sum ---

            value1  value2  value3  value4  value5
date                                              
2022.03.00  179425  180291  179620  180332  180839
2022.03.01  180287  179782  179759  180502  180240
2022.03.02  179555  180402  180356  180607  180100
2022.03.03  180286  180369  180711  180005  180127
2022.03.04  180748  180458  180407  179553  181962
2022.03.05  179400  180108  179886  179032  179100
2022.03.06  179840  180928  180419  180632  179872
2022.03.07  179978  179485  179491  179624  179480
2022.03.08  180303  180035  180178  180046  180248
2022.03.09  179888  179902  179356  180080  180490
2022.03.10  179857  180206  179456  181891  179755
2022.03.11  180794  179728  180788  180188  179809
2022.03.12  179412  179732  181141  180800  179825
2022.03.13  179282  180001  180674  179768  179096
2022.03.14  180105  180184  179754  179983  180434
2022.03.15  179224  180054  180243  179966  179743
2022.03.16  179437  178779  179854  179732  179718
2022.03.17  179559  180115  179454  179088  180622
2022.03.18  180319  179567  180182  180466  180690
2022.03.19  180789  180021  180762  179747  179533
2022.03.20  180664  179912  180141  180142  179033
2022.03.21  179183  180499  181146  180034  179364
2022.03.22  179403  179754  181065  180015  180279
2022.03.23  179300  179739  179041  180516  180266
2022.03.24  179836  179909  179042  180705  180381

time: 0.24 seconds

--- min ---

            value1  value2  value3  value4  value5
date                                              
2022.03.00       0       0       0       0       0
2022.03.01       0       0       0       0       0
2022.03.02       0       0       0       0       0
2022.03.03       0       0       0       0       0
2022.03.04       0       0       0       0       0
2022.03.05       0       0       0       0       0
2022.03.06       0       0       0       0       0
2022.03.07       0       0       0       0       0
2022.03.08       0       0       0       0       0
2022.03.09       0       0       0       0       0
2022.03.10       0       0       0       0       0
2022.03.11       0       0       0       0       0
2022.03.12       0       0       0       0       0
2022.03.13       0       0       0       0       0
2022.03.14       0       0       0       0       0
2022.03.15       0       0       0       0       0
2022.03.16       0       0       0       0       0
2022.03.17       0       0       0       0       0
2022.03.18       0       0       0       0       0
2022.03.19       0       0       0       0       0
2022.03.20       0       0       0       0       0
2022.03.21       0       0       0       0       0
2022.03.22       0       0       0       0       0
2022.03.23       0       0       0       0       0
2022.03.24       0       0       0       0       0

time: 0.17 seconds

--- max ---

            value1  value2  value3  value4  value5
date                                              
2022.03.00       9       9       9       9       9
2022.03.01       9       9       9       9       9
2022.03.02       9       9       9       9       9
2022.03.03       9       9       9       9       9
2022.03.04       9       9       9       9       9
2022.03.05       9       9       9       9       9
2022.03.06       9       9       9       9       9
2022.03.07       9       9       9       9       9
2022.03.08       9       9       9       9       9
2022.03.09       9       9       9       9       9
2022.03.10       9       9       9       9       9
2022.03.11       9       9       9       9       9
2022.03.12       9       9       9       9       9
2022.03.13       9       9       9       9       9
2022.03.14       9       9       9       9       9
2022.03.15       9       9       9       9       9
2022.03.16       9       9       9       9       9
2022.03.17       9       9       9       9       9
2022.03.18       9       9       9       9       9
2022.03.19       9       9       9       9       9
2022.03.20       9       9       9       9       9
2022.03.21       9       9       9       9       9
2022.03.22       9       9       9       9       9
2022.03.23       9       9       9       9       9
2022.03.24       9       9       9       9       9

time: 0.16 seconds

--- apply ---

date              
2022.03.00  0        -16
            25        -2
            50       -16
            75         0
            100      -12
                      ..
2022.03.24  999899     0
            999924   -14
            999949    -4
            999974   -16
            999999    -6
Name: value1, Length: 1000000, dtype: int64

time: 0.96 seconds

--- 2022.03.00 ---

          date  value1  value2  value3  value4  value5
0   2022.03.00       8       3       2       3       8
25  2022.03.00       1       9       7       8       4
50  2022.03.00       8       5       3       6       5

--- 2022.03.01 ---

          date  value1  value2  value3  value4  value5
1   2022.03.01       7       1       3       2       3
26  2022.03.01       4       1       9       8       5
51  2022.03.01       4       5       5       9       5

--- 2022.03.02 ---

          date  value1  value2  value3  value4  value5
2   2022.03.02       4       5       9       8       5
27  2022.03.02       6       7       1       6       6
52  2022.03.02       0       1       8       8       5

--- 2022.03.03 ---

          date  value1  value2  value3  value4  value5
3   2022.03.03       2       4       2       9       1
28  2022.03.03       9       8       1       8       2
53  2022.03.03       3       1       4       7       7

--- 2022.03.04 ---

          date  value1  value2  value3  value4  value5
4   2022.03.04       5       9       5       9       6
29  2022.03.04       9       7       0       6       2
54  2022.03.04       5       4       2       3       9

--- 2022.03.05 ---

          date  value1  value2  value3  value4  value5
5   2022.03.05       4       9       8       0       4
30  2022.03.05       4       0       8       1       6
55  2022.03.05       9       8       4       5       7

--- 2022.03.06 ---

          date  value1  value2  value3  value4  value5
6   2022.03.06       7       8       7       4       2
31  2022.03.06       8       4       1       9       1
56  2022.03.06       1       5       3       1       8

--- 2022.03.07 ---

          date  value1  value2  value3  value4  value5
7   2022.03.07       3       8       3       3       4
32  2022.03.07       2       6       3       9       2
57  2022.03.07       5       5       6       9       3

--- 2022.03.08 ---

          date  value1  value2  value3  value4  value5
8   2022.03.08       4       3       6       0       6
33  2022.03.08       8       2       5       4       0
58  2022.03.08       7       3       1       7       5

--- 2022.03.09 ---

          date  value1  value2  value3  value4  value5
9   2022.03.09       5       9       4       1       6
34  2022.03.09       5       4       2       8       2
59  2022.03.09       5       4       4       4       4

--- 2022.03.10 ---

          date  value1  value2  value3  value4  value5
10  2022.03.10       9       7       5       1       3
35  2022.03.10       0       4       9       7       6
60  2022.03.10       8       1       1       0       3

--- 2022.03.11 ---

          date  value1  value2  value3  value4  value5
11  2022.03.11       5       9       7       3       0
36  2022.03.11       7       7       5       6       2
61  2022.03.11       5       7       5       7       2

--- 2022.03.12 ---

          date  value1  value2  value3  value4  value5
12  2022.03.12       2       8       8       9       0
37  2022.03.12       3       1       0       3       8
62  2022.03.12       9       5       1       0       6

--- 2022.03.13 ---

          date  value1  value2  value3  value4  value5
13  2022.03.13       7       5       4       9       2
38  2022.03.13       8       9       9       8       6
63  2022.03.13       6       1       7       1       5

--- 2022.03.14 ---

          date  value1  value2  value3  value4  value5
14  2022.03.14       6       7       0       2       0
39  2022.03.14       6       4       9       8       4
64  2022.03.14       5       4       6       3       1

--- 2022.03.15 ---

          date  value1  value2  value3  value4  value5
15  2022.03.15       2       1       0       3       7
40  2022.03.15       0       8       7       0       8
65  2022.03.15       4       1       6       8       3

--- 2022.03.16 ---

          date  value1  value2  value3  value4  value5
16  2022.03.16       9       3       9       5       8
41  2022.03.16       4       4       4       1       4
66  2022.03.16       5       5       4       3       7

--- 2022.03.17 ---

          date  value1  value2  value3  value4  value5
17  2022.03.17       9       8       5       1       6
42  2022.03.17       8       1       1       1       9
67  2022.03.17       3       0       9       7       2

--- 2022.03.18 ---

          date  value1  value2  value3  value4  value5
18  2022.03.18       8       2       8       1       9
43  2022.03.18       2       1       6       9       6
68  2022.03.18       5       9       6       8       3

--- 2022.03.19 ---

          date  value1  value2  value3  value4  value5
19  2022.03.19       9       6       2       1       2
44  2022.03.19       3       1       1       2       7
69  2022.03.19       2       2       6       1       1

--- 2022.03.20 ---

          date  value1  value2  value3  value4  value5
20  2022.03.20       3       1       0       8       3
45  2022.03.20       8       9       8       9       1
70  2022.03.20       1       3       6       6       1

--- 2022.03.21 ---

          date  value1  value2  value3  value4  value5
21  2022.03.21       7       9       6       2       6
46  2022.03.21       1       6       5       0       2
71  2022.03.21       1       6       7       5       7

--- 2022.03.22 ---

          date  value1  value2  value3  value4  value5
22  2022.03.22       8       3       9       9       9
47  2022.03.22       5       4       1       2       4
72  2022.03.22       6       1       4       9       1

--- 2022.03.23 ---

          date  value1  value2  value3  value4  value5
23  2022.03.23       6       4       7       0       7
48  2022.03.23       2       8       9       3       0
73  2022.03.23       6       3       5       2       6

--- 2022.03.24 ---

          date  value1  value2  value3  value4  value5
24  2022.03.24       4       1       1       0       0
49  2022.03.24       9       5       0       3       8
74  2022.03.24       4       8       0       1       2

time: 0.72 seconds

EDIT:

I tested with 1_000 groups

df = pd.DataFrame({
    'date': [ f'2022.03.{i:04}' for _ in range(1_000)
                for i in range(1_000)],
    'value1': random.choices(range(10), k=1_000_000),
    'value2': random.choices(range(10), k=1_000_000),
    'value3': random.choices(range(10), k=1_000_000),
    'value4': random.choices(range(10), k=1_000_000),
    'value5': random.choices(range(10), k=1_000_000),
})

and standard functions .sum(), .max() etc. need similar time - less than 1 second.

for-loop needed 14 seconds because it has to print() more values.
Without print() it needs less than 1 second.

  • Related