In Pandas, I have a large DF with millions of rows. There are typically thousands of rows per a particular date
, all of them relevant to a particular event.
I want to iterate through, processing in groups the rows with a shared date. For example, this is my current approach:
# Gets me unique date strings
dates = df['date'].unique()
for date in dates:
dateDF = df[df['date'] == date]
# Do stuff with these df rows
Iterating this across thousands of dates is relatively slow. The slow part is the filtering. I imagine it's searching every element within the larger data frame, for ones which share the same date. However - all of the DF elements are chronological. So for the first date, which may be 1/1/22 10:00:00 0000
, rows 0-1000 may share this date, and then none of the other millions of rows share it. The next one, may be rows 1001-2000.
So I wonder if there's a more optimal way to do this: I know that all of these rows which share a date are grouped together, so there's no need to check beyond the one after the last one, which will have a different date.
Is there a more optimal way I could be doing this?
CodePudding user response:
I haven't so big data to test it but it has .groupby()
and later you can works with every group separatelly or run some functions on all groups at once - i.e. .sum()
, .min()
, .max()
or even .apply()
.
df.groupby('date').sum()
df.groupby('date').min()
df.groupby('date').max()
df.groupby('date').apply(lambda x: -x['value'] * 2 )
You can use groupby
with for
-loop to work with every group separatelly
for date, grp in df.groupby('date'):
print('\n---', date, '---\n')
print(grp[:3])
Minimal working code with 1_000_000 rows and 6 columns but only with 25 groups/dates. So every date has
import pandas as pd
import random
import time
# randome data - but with `seed(0)` it generates always the same data
random.seed(0)
df = pd.DataFrame({
'date': [ f'2022.03.{i:02}' for _ in range(40_000)
for i in range(25)],
'value1': random.choices(range(10), k=1_000_000),
'value2': random.choices(range(10), k=1_000_000),
'value3': random.choices(range(10), k=1_000_000),
'value4': random.choices(range(10), k=1_000_000),
'value5': random.choices(range(10), k=1_000_000),
})
start = time.time()
print('\n--- sum ---\n')
print(df.groupby('date').sum())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')
start = time.time()
print('\n--- min ---\n')
print(df.groupby('date').min())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')
start = time.time()
print('\n--- max ---\n')
print(df.groupby('date').max())
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')
start = time.time()
print('\n--- apply ---\n')
print(df.groupby('date').apply(lambda x: -x['value1'] * 2 ))
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')
start = time.time()
for date, grp in df.groupby('date'):
print('\n---', date, '---\n')
print(grp[:3])
end = time.time()
print(f'\ntime: {end-start:.2f} seconds')
Result (with times):
--- sum ---
value1 value2 value3 value4 value5
date
2022.03.00 179425 180291 179620 180332 180839
2022.03.01 180287 179782 179759 180502 180240
2022.03.02 179555 180402 180356 180607 180100
2022.03.03 180286 180369 180711 180005 180127
2022.03.04 180748 180458 180407 179553 181962
2022.03.05 179400 180108 179886 179032 179100
2022.03.06 179840 180928 180419 180632 179872
2022.03.07 179978 179485 179491 179624 179480
2022.03.08 180303 180035 180178 180046 180248
2022.03.09 179888 179902 179356 180080 180490
2022.03.10 179857 180206 179456 181891 179755
2022.03.11 180794 179728 180788 180188 179809
2022.03.12 179412 179732 181141 180800 179825
2022.03.13 179282 180001 180674 179768 179096
2022.03.14 180105 180184 179754 179983 180434
2022.03.15 179224 180054 180243 179966 179743
2022.03.16 179437 178779 179854 179732 179718
2022.03.17 179559 180115 179454 179088 180622
2022.03.18 180319 179567 180182 180466 180690
2022.03.19 180789 180021 180762 179747 179533
2022.03.20 180664 179912 180141 180142 179033
2022.03.21 179183 180499 181146 180034 179364
2022.03.22 179403 179754 181065 180015 180279
2022.03.23 179300 179739 179041 180516 180266
2022.03.24 179836 179909 179042 180705 180381
time: 0.24 seconds
--- min ---
value1 value2 value3 value4 value5
date
2022.03.00 0 0 0 0 0
2022.03.01 0 0 0 0 0
2022.03.02 0 0 0 0 0
2022.03.03 0 0 0 0 0
2022.03.04 0 0 0 0 0
2022.03.05 0 0 0 0 0
2022.03.06 0 0 0 0 0
2022.03.07 0 0 0 0 0
2022.03.08 0 0 0 0 0
2022.03.09 0 0 0 0 0
2022.03.10 0 0 0 0 0
2022.03.11 0 0 0 0 0
2022.03.12 0 0 0 0 0
2022.03.13 0 0 0 0 0
2022.03.14 0 0 0 0 0
2022.03.15 0 0 0 0 0
2022.03.16 0 0 0 0 0
2022.03.17 0 0 0 0 0
2022.03.18 0 0 0 0 0
2022.03.19 0 0 0 0 0
2022.03.20 0 0 0 0 0
2022.03.21 0 0 0 0 0
2022.03.22 0 0 0 0 0
2022.03.23 0 0 0 0 0
2022.03.24 0 0 0 0 0
time: 0.17 seconds
--- max ---
value1 value2 value3 value4 value5
date
2022.03.00 9 9 9 9 9
2022.03.01 9 9 9 9 9
2022.03.02 9 9 9 9 9
2022.03.03 9 9 9 9 9
2022.03.04 9 9 9 9 9
2022.03.05 9 9 9 9 9
2022.03.06 9 9 9 9 9
2022.03.07 9 9 9 9 9
2022.03.08 9 9 9 9 9
2022.03.09 9 9 9 9 9
2022.03.10 9 9 9 9 9
2022.03.11 9 9 9 9 9
2022.03.12 9 9 9 9 9
2022.03.13 9 9 9 9 9
2022.03.14 9 9 9 9 9
2022.03.15 9 9 9 9 9
2022.03.16 9 9 9 9 9
2022.03.17 9 9 9 9 9
2022.03.18 9 9 9 9 9
2022.03.19 9 9 9 9 9
2022.03.20 9 9 9 9 9
2022.03.21 9 9 9 9 9
2022.03.22 9 9 9 9 9
2022.03.23 9 9 9 9 9
2022.03.24 9 9 9 9 9
time: 0.16 seconds
--- apply ---
date
2022.03.00 0 -16
25 -2
50 -16
75 0
100 -12
..
2022.03.24 999899 0
999924 -14
999949 -4
999974 -16
999999 -6
Name: value1, Length: 1000000, dtype: int64
time: 0.96 seconds
--- 2022.03.00 ---
date value1 value2 value3 value4 value5
0 2022.03.00 8 3 2 3 8
25 2022.03.00 1 9 7 8 4
50 2022.03.00 8 5 3 6 5
--- 2022.03.01 ---
date value1 value2 value3 value4 value5
1 2022.03.01 7 1 3 2 3
26 2022.03.01 4 1 9 8 5
51 2022.03.01 4 5 5 9 5
--- 2022.03.02 ---
date value1 value2 value3 value4 value5
2 2022.03.02 4 5 9 8 5
27 2022.03.02 6 7 1 6 6
52 2022.03.02 0 1 8 8 5
--- 2022.03.03 ---
date value1 value2 value3 value4 value5
3 2022.03.03 2 4 2 9 1
28 2022.03.03 9 8 1 8 2
53 2022.03.03 3 1 4 7 7
--- 2022.03.04 ---
date value1 value2 value3 value4 value5
4 2022.03.04 5 9 5 9 6
29 2022.03.04 9 7 0 6 2
54 2022.03.04 5 4 2 3 9
--- 2022.03.05 ---
date value1 value2 value3 value4 value5
5 2022.03.05 4 9 8 0 4
30 2022.03.05 4 0 8 1 6
55 2022.03.05 9 8 4 5 7
--- 2022.03.06 ---
date value1 value2 value3 value4 value5
6 2022.03.06 7 8 7 4 2
31 2022.03.06 8 4 1 9 1
56 2022.03.06 1 5 3 1 8
--- 2022.03.07 ---
date value1 value2 value3 value4 value5
7 2022.03.07 3 8 3 3 4
32 2022.03.07 2 6 3 9 2
57 2022.03.07 5 5 6 9 3
--- 2022.03.08 ---
date value1 value2 value3 value4 value5
8 2022.03.08 4 3 6 0 6
33 2022.03.08 8 2 5 4 0
58 2022.03.08 7 3 1 7 5
--- 2022.03.09 ---
date value1 value2 value3 value4 value5
9 2022.03.09 5 9 4 1 6
34 2022.03.09 5 4 2 8 2
59 2022.03.09 5 4 4 4 4
--- 2022.03.10 ---
date value1 value2 value3 value4 value5
10 2022.03.10 9 7 5 1 3
35 2022.03.10 0 4 9 7 6
60 2022.03.10 8 1 1 0 3
--- 2022.03.11 ---
date value1 value2 value3 value4 value5
11 2022.03.11 5 9 7 3 0
36 2022.03.11 7 7 5 6 2
61 2022.03.11 5 7 5 7 2
--- 2022.03.12 ---
date value1 value2 value3 value4 value5
12 2022.03.12 2 8 8 9 0
37 2022.03.12 3 1 0 3 8
62 2022.03.12 9 5 1 0 6
--- 2022.03.13 ---
date value1 value2 value3 value4 value5
13 2022.03.13 7 5 4 9 2
38 2022.03.13 8 9 9 8 6
63 2022.03.13 6 1 7 1 5
--- 2022.03.14 ---
date value1 value2 value3 value4 value5
14 2022.03.14 6 7 0 2 0
39 2022.03.14 6 4 9 8 4
64 2022.03.14 5 4 6 3 1
--- 2022.03.15 ---
date value1 value2 value3 value4 value5
15 2022.03.15 2 1 0 3 7
40 2022.03.15 0 8 7 0 8
65 2022.03.15 4 1 6 8 3
--- 2022.03.16 ---
date value1 value2 value3 value4 value5
16 2022.03.16 9 3 9 5 8
41 2022.03.16 4 4 4 1 4
66 2022.03.16 5 5 4 3 7
--- 2022.03.17 ---
date value1 value2 value3 value4 value5
17 2022.03.17 9 8 5 1 6
42 2022.03.17 8 1 1 1 9
67 2022.03.17 3 0 9 7 2
--- 2022.03.18 ---
date value1 value2 value3 value4 value5
18 2022.03.18 8 2 8 1 9
43 2022.03.18 2 1 6 9 6
68 2022.03.18 5 9 6 8 3
--- 2022.03.19 ---
date value1 value2 value3 value4 value5
19 2022.03.19 9 6 2 1 2
44 2022.03.19 3 1 1 2 7
69 2022.03.19 2 2 6 1 1
--- 2022.03.20 ---
date value1 value2 value3 value4 value5
20 2022.03.20 3 1 0 8 3
45 2022.03.20 8 9 8 9 1
70 2022.03.20 1 3 6 6 1
--- 2022.03.21 ---
date value1 value2 value3 value4 value5
21 2022.03.21 7 9 6 2 6
46 2022.03.21 1 6 5 0 2
71 2022.03.21 1 6 7 5 7
--- 2022.03.22 ---
date value1 value2 value3 value4 value5
22 2022.03.22 8 3 9 9 9
47 2022.03.22 5 4 1 2 4
72 2022.03.22 6 1 4 9 1
--- 2022.03.23 ---
date value1 value2 value3 value4 value5
23 2022.03.23 6 4 7 0 7
48 2022.03.23 2 8 9 3 0
73 2022.03.23 6 3 5 2 6
--- 2022.03.24 ---
date value1 value2 value3 value4 value5
24 2022.03.24 4 1 1 0 0
49 2022.03.24 9 5 0 3 8
74 2022.03.24 4 8 0 1 2
time: 0.72 seconds
EDIT:
I tested with 1_000 groups
df = pd.DataFrame({
'date': [ f'2022.03.{i:04}' for _ in range(1_000)
for i in range(1_000)],
'value1': random.choices(range(10), k=1_000_000),
'value2': random.choices(range(10), k=1_000_000),
'value3': random.choices(range(10), k=1_000_000),
'value4': random.choices(range(10), k=1_000_000),
'value5': random.choices(range(10), k=1_000_000),
})
and standard functions .sum()
, .max()
etc. need similar time - less than 1 second.
for
-loop needed 14 seconds because it has to print()
more values.
Without print()
it needs less than 1 second.