How do I implement rank function for nearest values for a column in dataframe?-CodePudding

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

desired df

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

How do i use rank function to filter only the 2 nearest match_datetime dates for every run_time value. i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime values for every run_time

CodePudding user response：

Update

Using rank instead of head:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]

Output:

>>> out
                     run_time match_datetime         country         league          home_team                  away_team
1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19

Alternative

You can use:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                              .abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()

CodePudding user response：

I am somehow afraid that the pandas.DataFrame.rank method can't do this. But pandas.DataFrame.groupby can do this, if you use pandas.DataFrame.head with it.

Assuming you have the following pandas.DataFrame:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])

And that you want to keep max_num_per_example = 2 representatives of each unique values in the column df['a']:

max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)

yields

	a	b
0	2.0	0.058084
1	0.0	0.866176
2	2.0	0.601115
4	0.0	0.020584
7	1.0	0.212339

This is the same as you would get if you to the naive approach:

max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
    lg = el_uq == df['a']
    for i, idx in enumerate(lg[lg].index):
        if i < max_idx_per_example:
            idx_to_keep.append(idx)
        else:
            break
df_new = df.iloc[idx_to_keep]

Which underlines the power of pandas =)