Home > OS >  How to groupby in pandas and return dataframe instead of series?
How to groupby in pandas and return dataframe instead of series?

Time:03-05

print(df.sample(50)):

match_datetime       country                       league          home_team            away_team  home_odds  draw_odds  away_odds                    run_time  home_score  away_score
72170  2021-10-17 12:30:00       Ukraine                  Persha Liga           Alliance             Uzhhorod       1.22       5.62       9.71  2021-10-17 09:22:20.212731         NaN         NaN
100398 2021-11-02 14:35:00  Saudi Arabia                   Division 1        Al Qadisiya                Bisha       1.61       3.61       4.94  2021-11-02 09:13:18.768604         2.0         1.0
33929  2021-09-11 23:00:00        Panama                          LPF           Veraguas         Plaza Amador       2.75       2.75       2.71  2021-09-10 23:47:54.682982         1.0         1.0
12328  2021-08-15 15:30:00        Poland                  Ekstraklasa      Slask Wroclaw               Leczna       1.74       3.74       4.59  2021-08-14 22:44:26.136608         0.0         0.0
81500  2021-10-24 13:00:00         Italy            Serie D - Group A          Caronnese              Saluzzo       1.69       3.60       4.28  2021-10-23 13:37:16.920175         2.0         2.0
143370 2021-12-05 14:00:00        Poland                   Division 1     Chrobry Glogow          Widzew Lodz       3.36       3.17       2.15  2021-11-30 17:40:24.833519         0.0         0.0
175061 2022-01-08 18:00:00         Spain       Primera RFEF - Group 1           R. Union       Extremadura UD       1.26       4.40      18.00  2022-01-08 17:00:46.662761         0.0         1.0
21293  2021-08-29 16:00:00         Italy                      Serie B         Cittadella              Crotone       2.32       3.11       3.31  2021-08-26 18:04:46.221393         4.0         2.0
97427  2021-11-01 17:00:00        Israel                Leumit League        M. Nazareth      Beitar Tel Aviv       1.92       3.26       3.75  2021-10-30 09:40:08.966330         4.0         2.0
177665 2022-01-13 12:30:00         Egypt         Division 2 - Group C    Said El Mahalla              Al Magd       4.12       3.08       1.94  2022-01-12 17:53:33.570126         0.0         0.0
69451  2021-10-17 05:00:00   South Korea                   K League 1            Gangwon           Gwangju FC       2.06       3.38       3.65  2021-10-15 09:55:54.578112         NaN         NaN
4742   2021-08-10 20:30:00          Peru                       Liga 2  Deportivo Coopsol                 Grau       3.14       3.49       2.06  2021-08-10 18:14:01.996860         0.0         2.0
22266  2021-08-29 13:00:00        France                      Ligue 1             Angers               Rennes       2.93       3.27       2.56  2021-08-27 12:26:34.904374         2.0         0.0
46412  2021-09-26 04:00:00         Japan                    J2 League            Okayama            Blaublitz       2.24       2.90       3.63  2021-09-23 09:08:26.979783         1.0         1.0
133207 2021-11-27 21:15:00       Bolivia         Division Profesional          Palmaflor             Blooming       1.51       4.05       5.10  2021-11-25 18:22:28.275844         3.0         0.0
140825 2021-11-28 11:00:00         Spain       Tercera RFEF - Group 6         Valencia B           Torrellano       1.58       3.56       5.26  2021-11-28 19:54:40.066637         2.0         0.0
226985 2022-03-04 00:30:00     Argentina  Copa de la Liga Profesional    Central Cordoba      Rosario Central       2.36       3.26       2.86  2022-03-02 17:23:10.014424         0.0         1.0
137226 2021-11-28 12:45:00        Greece               Super League 2     Apollon Pontou               PAOK B       3.37       3.25       2.01  2021-11-27 15:13:05.937815         0.0         3.0
182756 2022-01-22 10:30:00        Turkey                       1. Lig       Umraniyespor          Menemenspor       1.40       4.39       7.07  2022-01-19 17:25:27.128331         2.0         1.0
89895  2021-10-28 16:45:00   Netherlands                   KNVB Beker               Ajax              Cambuur       9.10       5.55       1.26  2021-10-27 07:46:56.253996         0.0         5.0
227595 2022-03-06 17:00:00        Israel                  Ligat ha'Al             Ashdod  Maccabi Petah Tikva       2.30       3.21       3.05  2022-03-02 17:23:10.014424         NaN         NaN
57568  2021-10-02 13:00:00       Estonia                 Meistriliiga              Kalju               Legion       1.58       4.10       4.84  2021-10-02 10:55:35.287359         2.0         2.0
227035 2022-03-04 19:00:00       Denmark                    Superliga      FC Copenhagen           Randers FC       1.70       3.84       5.06  2022-03-02 17:23:10.014424         NaN         NaN
108668 2021-11-07 13:30:00       Germany         Oberliga Mittelrhein              Duren       Freialdenhoven       1.35       5.20       6.35  2021-11-06 17:37:37.629603         2.0         0.0
86270  2021-10-25 18:00:00       Belgium               Pro League U21      Lommel SK U21        Lierse K. U21       3.23       3.84       1.92  2021-10-26 01:22:31.111441         0.0         0.0
89437  2021-11-01 02:10:00      Colombia                    Primera A    America De Cali            Petrolera       1.86       2.92       4.60  2021-10-27 07:41:24.427246         NaN         NaN
13986  2021-08-21 13:00:00        France                      Ligue 2              Dijon             Toulouse       3.92       3.51       1.94  2021-08-16 13:22:02.749887         2.0         4.0
105179 2021-11-06 15:00:00       England         NPL Premier Division           Atherton        South Shields       3.90       3.42       1.82  2021-11-05 10:01:28.567328         1.0         1.0
142821 2021-12-01 12:30:00      Bulgaria                   Vtora liga              Marek    Septemvri Simitli       1.79       3.38       4.35  2021-11-30 17:40:24.833519         2.0         2.0
45866  2021-09-24 00:30:00     Venezuela             Primera Division       Dep. Tachira           Portuguesa       1.96       3.60       3.22  2021-09-23 09:08:26.979783         4.0         1.0
76100  2021-10-22 16:30:00       Denmark                 1st Division        Hvidovre IF                 Koge       1.91       3.56       3.81  2021-10-21 08:43:12.445245         NaN         NaN
115896 2021-11-14 16:00:00         Spain       Tercera RFEF - Group 6     Olimpic Xativa           Torrellano       2.78       2.89       2.39  2021-11-13 12:21:45.955738         1.0         0.0
156159 2021-12-12 16:00:00         Spain       Segunda RFEF - Group 1   Marino de Luanco            Coruxo FC       2.19       3.27       3.07  2021-12-10 09:26:45.001977         0.0         0.0
18240  2021-08-21 12:00:00       Germany            Regionalliga West       Rodinghausen         Fortuna Koln       3.25       3.60       2.00  2021-08-21 03:30:43.193978         NaN         NaN
184913 2022-01-22 10:00:00         World                Club Friendly           Zilina B               Trinec       3.56       4.14       1.78  2022-01-22 16:44:32.650325         0.0         3.0
16782  2021-08-22 23:05:00      Colombia                    Primera A          Petrolera            Dep. Cali       3.01       3.00       2.44  2021-08-19 18:24:24.966505         2.0         3.0
63847  2021-10-10 09:30:00         Spain       Tercera RFEF - Group 7        Carabanchel           RSD Alcala       4.39       3.42       1.75  2021-10-09 12:03:50.720013         NaN         NaN
7254   2021-08-12 16:45:00        Europe     Europa Conference League           Hammarby            Cukaricki       1.72       3.87       4.13  2021-08-11 23:48:31.958394         NaN         NaN
82727  2021-10-24 14:00:00     Lithuania                       I Lyga         Zalgiris 2             Neptunas       1.76       3.78       3.35  2021-10-24 12:02:06.306279         1.0         3.0
43074  2021-09-22 18:00:00       Ukraine                    Super Cup   Shakhtar Donetsk            Dyn. Kyiv       2.57       3.49       2.59  2021-09-19 09:39:56.624504         NaN         NaN
65187  2021-10-11 18:45:00         World                    World Cup             Norway           Montenegro       1.56       4.17       6.28  2021-10-11 10:56:09.973470         NaN         NaN
120993 2021-11-18 00:00:00           USA                         NISA   Maryland Bobcats      California Utd.       2.76       3.23       2.39  2021-11-17 20:36:26.562731         1.0         1.0
201469 2022-02-12 15:00:00       England                   League One      AFC Wimbledon           Sunderland       3.30       3.48       2.17  2022-02-10 17:47:36.501159         1.0         1.0
142180 2021-12-01 19:45:00      Scotland                  Premiership         St. Mirren          Ross County       2.06       3.25       3.85  2021-11-29 18:28:22.249662         0.0         0.0
4681   2021-08-10 18:30:00        Europe             Champions League         Young Boys             CFR Cluj       1.48       4.29       6.92  2021-08-10 18:14:01.996860         3.0         1.0
67321  2021-10-17 13:00:00         Spain                       LaLiga     Rayo Vallecano                Elche       1.78       3.64       4.99  2021-10-13 11:22:34.979378         NaN         NaN
27499  2021-09-04 14:00:00       Iceland              Inkasso-deildin         Kordrengir              Fjolnir       2.18       3.66       2.82  2021-09-02 23:28:49.414126         1.0         4.0
48962  2021-09-25 21:00:00        Mexico         Liga Premier Serie B            Uruapan         Lobos Huerta       1.83       3.69       3.70  2021-09-25 13:02:58.238466         NaN         NaN
65636  2021-10-16 17:00:00   Switzerland                 Super League         Young Boys               Luzern       1.26       6.04       9.43  2021-10-11 10:56:09.973470         NaN         NaN
17333  2021-08-21 14:00:00       Finland             Kakkonen Group A           Atlantis               Kiffen       1.57       4.29       4.42  2021-08-20 12:41:03.159846         1.0         1.0
        

I am trying to get the latest 2 match_datetime values for every run_time and then filter(join) df to get all the relevant values as below:

df['match_datetime'] = pd.to_datetime(df['match_datetime'])

s = (df['match_datetime'].dt.normalize()
     .groupby([df['run_time']])
     .value_counts()
     .groupby(level=0)
     .head(2))
print(s)
run_time                    match_datetime
2021-08-07 00:04:36.326391  2021-08-07        255
                            2021-08-06        188
2021-08-07 10:50:34.574040  2021-08-07        649
                            2021-08-08        277
2021-08-07 16:56:22.322338  2021-08-07        712

This returns a series while I want a DataFrame so I can merge.

To do this:

df_n = df.reset_index().merge(s, how="left",
                              left_on=["match_datetime", "run_time"],
                              right_on=["match_datetime", "run_time"])

While I am sure there is a better manner I can write function s but I am unsure how to do it the correct way.

CodePudding user response:

If I understand correctly, you would like to filter the dataframe to retain, for each run_time, the last two rows (or up to two rows) by match_datetime.

Simplified answer

This can be done relatively easily without any join, using GroupBy.tail(). (Note, my original answer was using GroupBy.rank(), but this is simpler, although slower):

out = df.sort_values(
    ['run_time', 'match_datetime']
).groupby('run_time').tail(2)

Minimal example

import numpy as np

np.random.seed(0)
n = 10
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='D'), n)
df = pd.DataFrame({
    'run_time': rt,
    'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='days'),
})
df['match_datetime'] = df['match_datetime'].dt.round('h')

Then:

out = df.sort_values(
    ['run_time', 'match_datetime']
).groupby('run_time').tail(2)

>>> out
    run_time      match_datetime
9 2022-01-01 2021-12-31 04:00:00
1 2022-01-01 2021-12-31 15:00:00
5 2022-01-02 2022-01-01 02:00:00
7 2022-01-03 2022-01-02 22:00:00
3 2022-01-04 2022-01-03 11:00:00
6 2022-01-04 2022-01-03 22:00:00
0 2022-01-05 2022-01-04 01:00:00
8 2022-01-05 2022-01-05 00:00:00

On the OP's (extended) data

The output is quite verbose. Here is a sample:

>>> out['run_time match_datetime country draw_odds'.split()].head()
                        run_time      match_datetime country  draw_odds
4681  2021-08-10 18:14:01.996860 2021-08-10 18:30:00  Europe       4.29
4742  2021-08-10 18:14:01.996860 2021-08-10 20:30:00    Peru       3.49
7254  2021-08-11 23:48:31.958394 2021-08-12 16:45:00  Europe       3.87
12328 2021-08-14 22:44:26.136608 2021-08-15 15:30:00  Poland       3.74
13986 2021-08-16 13:22:02.749887 2021-08-21 13:00:00  France       3.51

Performance

For several millions of rows, the timing difference starts counting, and using rank is faster. Even faster, you can avoid sorting on run_time (the result is the same, but the rows are in a different order):

np.random.seed(0)
n = 1_000_000
rt = np.random.choice(pd.date_range('2022-01-01', periods=n//2, freq='min'), n)
df = pd.DataFrame({
    'run_time': rt,
    'match_datetime': rt - pd.to_timedelta(np.random.uniform(size=n), unit='s'),
})


%timeit df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
# 981 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
# 355 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit df.sort_values(['match_datetime']).groupby('run_time').tail(2)
# 258 ms ± 846 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Check solutions:

a = df.sort_values(['run_time', 'match_datetime']).groupby('run_time').tail(2)
b = df.loc[df.groupby('run_time')['match_datetime'].rank(method='first', ascending=False) <= 2]
c = df.sort_values(['match_datetime']).groupby('run_time').tail(2)

by = ['run_time', 'match_datetime']

>>> a.sort_values(by).equals(b.sort_values(by))
True

>>> b.sort_values(by).equals(c.sort_values(by))
True
  • Related