df.head():
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr
3 2021-08-07 00:04:36.326391 2021-08-09 China Jia League Chengdu Rongcheng Shenyang Urban FC
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
6 2021-08-06 00:04:36.326391 2021-08-08 Russia Youth League Konoplev Academy U19 Rubin Kazan U19
7 2021-08-06 00:04:36.326391 2021-08-09 World Club Friendly Real Sociedad Eibar
desired df
run_time match_datetime country league home_team away_team
0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
How do i use rank
function to filter only the 2 nearest match_datetime
dates for every run_time
value.
i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime
values for every run_time
CodePudding user response:
Update
Using rank
instead of head
:
diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]
Output:
>>> out
run_time match_datetime country league home_team away_team
1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19
2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr
4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger
5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
Alternative
You can use:
diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
.abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()
CodePudding user response:
I am somehow afraid that the pandas.DataFrame.rank
method can't do this. But pandas.DataFrame.groupby
can do this, if you use pandas.DataFrame.head
with it.
Assuming you have the following pandas.DataFrame
:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])
And that you want to keep max_num_per_example = 2
representatives of each unique values in the column df['a']
:
max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)
yields
a | b | |
---|---|---|
0 | 2.0 | 0.058084 |
1 | 0.0 | 0.866176 |
2 | 2.0 | 0.601115 |
4 | 0.0 | 0.020584 |
7 | 1.0 | 0.212339 |
This is the same as you would get if you to the naive approach:
max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
lg = el_uq == df['a']
for i, idx in enumerate(lg[lg].index):
if i < max_idx_per_example:
idx_to_keep.append(idx)
else:
break
df_new = df.iloc[idx_to_keep]
Which underlines the power of pandas
=)