print(df.sample(50)):
match_datetime country league home_team away_team home_odds draw_odds away_odds run_time home_score away_score
72170 2021-10-17 12:30:00 Ukraine Persha Liga Alliance Uzhhorod 1.22 5.62 9.71 2021-10-17 09:22:20.212731 NaN NaN
100398 2021-11-02 14:35:00 Saudi Arabia Division 1 Al Qadisiya Bisha 1.61 3.61 4.94 2021-11-02 09:13:18.768604 2.0 1.0
33929 2021-09-11 23:00:00 Panama LPF Veraguas Plaza Amador 2.75 2.75 2.71 2021-09-10 23:47:54.682982 1.0 1.0
12328 2021-08-15 15:30:00 Poland Ekstraklasa Slask Wroclaw Leczna 1.74 3.74 4.59 2021-08-14 22:44:26.136608 0.0 0.0
81500 2021-10-24 13:00:00 Italy Serie D - Group A Caronnese Saluzzo 1.69 3.60 4.28 2021-10-23 13:37:16.920175 2.0 2.0
143370 2021-12-05 14:00:00 Poland Division 1 Chrobry Glogow Widzew Lodz 3.36 3.17 2.15 2021-11-30 17:40:24.833519 0.0 0.0
175061 2022-01-08 18:00:00 Spain Primera RFEF - Group 1 R. Union Extremadura UD 1.26 4.40 18.00 2022-01-08 17:00:46.662761 0.0 1.0
21293 2021-08-29 16:00:00 Italy Serie B Cittadella Crotone 2.32 3.11 3.31 2021-08-26 18:04:46.221393 4.0 2.0
97427 2021-11-01 17:00:00 Israel Leumit League M. Nazareth Beitar Tel Aviv 1.92 3.26 3.75 2021-10-30 09:40:08.966330 4.0 2.0
177665 2022-01-13 12:30:00 Egypt Division 2 - Group C Said El Mahalla Al Magd 4.12 3.08 1.94 2022-01-12 17:53:33.570126 0.0 0.0
69451 2021-10-17 05:00:00 South Korea K League 1 Gangwon Gwangju FC 2.06 3.38 3.65 2021-10-15 09:55:54.578112 NaN NaN
4742 2021-08-10 20:30:00 Peru Liga 2 Deportivo Coopsol Grau 3.14 3.49 2.06 2021-08-10 18:14:01.996860 0.0 2.0
22266 2021-08-29 13:00:00 France Ligue 1 Angers Rennes 2.93 3.27 2.56 2021-08-27 12:26:34.904374 2.0 0.0
46412 2021-09-26 04:00:00 Japan J2 League Okayama Blaublitz 2.24 2.90 3.63 2021-09-23 09:08:26.979783 1.0 1.0
133207 2021-11-27 21:15:00 Bolivia Division Profesional Palmaflor Blooming 1.51 4.05 5.10 2021-11-25 18:22:28.275844 3.0 0.0
140825 2021-11-28 11:00:00 Spain Tercera RFEF - Group 6 Valencia B Torrellano 1.58 3.56 5.26 2021-11-28 19:54:40.066637 2.0 0.0
226985 2022-03-04 00:30:00 Argentina Copa de la Liga Profesional Central Cordoba Rosario Central 2.36 3.26 2.86 2022-03-02 17:23:10.014424 0.0 1.0
137226 2021-11-28 12:45:00 Greece Super League 2 Apollon Pontou PAOK B 3.37 3.25 2.01 2021-11-27 15:13:05.937815 0.0 3.0
182756 2022-01-22 10:30:00 Turkey 1. Lig Umraniyespor Menemenspor 1.40 4.39 7.07 2022-01-19 17:25:27.128331 2.0 1.0
89895 2021-10-28 16:45:00 Netherlands KNVB Beker Ajax Cambuur 9.10 5.55 1.26 2021-10-27 07:46:56.253996 0.0 5.0
227595 2022-03-06 17:00:00 Israel Ligat ha'Al Ashdod Maccabi Petah Tikva 2.30 3.21 3.05 2022-03-02 17:23:10.014424 NaN NaN
57568 2021-10-02 13:00:00 Estonia Meistriliiga Kalju Legion 1.58 4.10 4.84 2021-10-02 10:55:35.287359 2.0 2.0
227035 2022-03-04 19:00:00 Denmark Superliga FC Copenhagen Randers FC 1.70 3.84 5.06 2022-03-02 17:23:10.014424 NaN NaN
108668 2021-11-07 13:30:00 Germany Oberliga Mittelrhein Duren Freialdenhoven 1.35 5.20 6.35 2021-11-06 17:37:37.629603 2.0 0.0
86270 2021-10-25 18:00:00 Belgium Pro League U21 Lommel SK U21 Lierse K. U21 3.23 3.84 1.92 2021-10-26 01:22:31.111441 0.0 0.0
89437 2021-11-01 02:10:00 Colombia Primera A America De Cali Petrolera 1.86 2.92 4.60 2021-10-27 07:41:24.427246 NaN NaN
13986 2021-08-21 13:00:00 France Ligue 2 Dijon Toulouse 3.92 3.51 1.94 2021-08-16 13:22:02.749887 2.0 4.0
105179 2021-11-06 15:00:00 England NPL Premier Division Atherton South Shields 3.90 3.42 1.82 2021-11-05 10:01:28.567328 1.0 1.0
142821 2021-12-01 12:30:00 Bulgaria Vtora liga Marek Septemvri Simitli 1.79 3.38 4.35 2021-11-30 17:40:24.833519 2.0 2.0
45866 2021-09-24 00:30:00 Venezuela Primera Division Dep. Tachira Portuguesa 1.96 3.60 3.22 2021-09-23 09:08:26.979783 4.0 1.0
76100 2021-10-22 16:30:00 Denmark 1st Division Hvidovre IF Koge 1.91 3.56 3.81 2021-10-21 08:43:12.445245 NaN NaN
115896 2021-11-14 16:00:00 Spain Tercera RFEF - Group 6 Olimpic Xativa Torrellano 2.78 2.89 2.39 2021-11-13 12:21:45.955738 1.0 0.0
156159 2021-12-12 16:00:00 Spain Segunda RFEF - Group 1 Marino de Luanco Coruxo FC 2.19 3.27 3.07 2021-12-10 09:26:45.001977 0.0 0.0
18240 2021-08-21 12:00:00 Germany Regionalliga West Rodinghausen Fortuna Koln 3.25 3.60 2.00 2021-08-21 03:30:43.193978 NaN NaN
184913 2022-01-22 10:00:00 World Club Friendly Zilina B Trinec 3.56 4.14 1.78 2022-01-22 16:44:32.650325 0.0 3.0
16782 2021-08-22 23:05:00 Colombia Primera A Petrolera Dep. Cali 3.01 3.00 2.44 2021-08-19 18:24:24.966505 2.0 3.0
63847 2021-10-10 09:30:00 Spain Tercera RFEF - Group 7 Carabanchel RSD Alcala 4.39 3.42 1.75 2021-10-09 12:03:50.720013 NaN NaN
7254 2021-08-12 16:45:00 Europe Europa Conference League Hammarby Cukaricki 1.72 3.87 4.13 2021-08-11 23:48:31.958394 NaN NaN
82727 2021-10-24 14:00:00 Lithuania I Lyga Zalgiris 2 Neptunas 1.76 3.78 3.35 2021-10-24 12:02:06.306279 1.0 3.0
43074 2021-09-22 18:00:00 Ukraine Super Cup Shakhtar Donetsk Dyn. Kyiv 2.57 3.49 2.59 2021-09-19 09:39:56.624504 NaN NaN
65187 2021-10-11 18:45:00 World World Cup Norway Montenegro 1.56 4.17 6.28 2021-10-11 10:56:09.973470 NaN NaN
120993 2021-11-18 00:00:00 USA NISA Maryland Bobcats California Utd. 2.76 3.23 2.39 2021-11-17 20:36:26.562731 1.0 1.0
201469 2022-02-12 15:00:00 England League One AFC Wimbledon Sunderland 3.30 3.48 2.17 2022-02-10 17:47:36.501159 1.0 1.0
142180 2021-12-01 19:45:00 Scotland Premiership St. Mirren Ross County 2.06 3.25 3.85 2021-11-29 18:28:22.249662 0.0 0.0
4681 2021-08-10 18:30:00 Europe Champions League Young Boys CFR Cluj 1.48 4.29 6.92 2021-08-10 18:14:01.996860 3.0 1.0
67321 2021-10-17 13:00:00 Spain LaLiga Rayo Vallecano Elche 1.78 3.64 4.99 2021-10-13 11:22:34.979378 NaN NaN
27499 2021-09-04 14:00:00 Iceland Inkasso-deildin Kordrengir Fjolnir 2.18 3.66 2.82 2021-09-02 23:28:49.414126 1.0 4.0
48962 2021-09-25 21:00:00 Mexico Liga Premier Serie B Uruapan Lobos Huerta 1.83 3.69 3.70 2021-09-25 13:02:58.238466 NaN NaN
65636 2021-10-16 17:00:00 Switzerland Super League Young Boys Luzern 1.26 6.04 9.43 2021-10-11 10:56:09.973470 NaN NaN
17333 2021-08-21 14:00:00 Finland Kakkonen Group A Atlantis Kiffen 1.57 4.29 4.42 2021-08-20 12:41:03.159846 1.0 1.0
I am trying to get the latest 2 match_datetime
values for every run_time
and then filter(join) df to get all the relevant values as below:
df['match_datetime'] = pd.to_datetime(df['match_datetime'])
s = (df['match_datetime'].dt.normalize()
.groupby([df['run_time']])
.value_counts()
.groupby(level=0)
.head(2))
print(s)
run_time match_datetime
2021-08-07 00:04:36.326391 2021-08-07 255
2021-08-06 188
2021-08-07 10:50:34.574040 2021-08-07 649
2021-08-08 277
2021-08-07 16:56:22.322338 2021-08-07 712
This returns a series while I want a DataFrame so I can merge.
To do this:
df_n = df.reset_index().merge(s, how="left",
left_on=["match_datetime", "run_time"],
right_on=["match_datetime", "run_time"])
While I am sure there is a better manner I can write function s
but I am unsure how to do it the correct way.
CodePudding user response:
If I understand correctly, you would like to filter the dataframe to retain, for each run_time
, the last two rows (or up to two rows) by match_datetime
.
Simplified answer
This can be done relatively easily without any join, using GroupBy.tail()
. (Note, my original answer was using GroupBy.rank()
, but this is simpler, although slower):
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
Minimal example
import numpy as np
np.random.seed(0)
n = 10
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='D'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='days'),
})
df['match_datetime'] = df['match_datetime'].dt.round('h')
Then:
out = df.sort_values(
['run_time', 'match_datetime']
).groupby('run_time').tail(2)
>>> out
run_time match_datetime
9 2022-01-01 2021-12-31 04:00:00
1 2022-01-01 2021-12-31 15:00:00
5 2022-01-02 2022-01-01 02:00:00
7 2022-01-03 2022-01-02 22:00:00
3 2022-01-04 2022-01-03 11:00:00
6 2022-01-04 2022-01-03 22:00:00
0 2022-01-05 2022-01-04 01:00:00
8 2022-01-05 2022-01-05 00:00:00
On the OP's (extended) data
The output is quite verbose. Here is a sample:
>>> out['run_time match_datetime country draw_odds'.split()].head()
run_time match_datetime country draw_odds
4681 2021-08-10 18:14:01.996860 2021-08-10 18:30:00 Europe 4.29
4742 2021-08-10 18:14:01.996860 2021-08-10 20:30:00 Peru 3.49
7254 2021-08-11 23:48:31.958394 2021-08-12 16:45:00 Europe 3.87
12328 2021-08-14 22:44:26.136608 2021-08-15 15:30:00 Poland 3.74
13986 2021-08-16 13:22:02.749887 2021-08-21 13:00:00 France 3.51
Performance
For several millions of rows, the timing difference starts counting, and using rank
is faster. Even faster, you can avoid sorting on run_time
(the result is the same, but the rows are in a different order):
np.random.seed(0)
n = 1_000_000
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='min'), n)
df = pd.DataFrame({
'run_time': rt,
'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='s'),
})
%timeit df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
# 981 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
# 355 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.sort_values(['match_datetime']).groupby('run_time').tail(2)
# 258 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Check solutions:
a = df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
b = df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
c = df.sort_values(['match_datetime']).groupby('run_time').tail(2)
by = ['run_time', 'match_datetime']
>>> a.sort_values(by).equals(b.sort_values(by))
True
>>> b.sort_values(by).equals(c.sort_values(by))
True