df.head()
index match_datetime country league home_team away_team home_odds draw_odds away_odds predicted_home_score predicted_away_score predicted_btts predicted_class_btts predicted_bttso2.5 predicted_class_bttso2.5 predicted_bttsu2.5 predicted_class_bttsu2.5 predicted_o2.5 predicted_class_o2.5 predicted_u2.5 predicted_class_u2.5 predicted_o1.5 predicted_class_o1.5 predicted_u1.5 predicted_class_u1.5 run_time predicted_o4 predicted_class_o4 predicted_u4 predicted_class_u4 predicted_draw predicted_class_draw predicted_total_score predicted_score_difference Win DNB O 1.5 U 4.5 datetime home_score away_score selection predicted_result result total_score score_difference result_match selection_match
0 0 2021-08-06 06:00:00 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka 2.61 3.51 2.32 1.45 1.73 0.66 1 0.56 0 0.11 0 0.62 1 0.33 0 0.76 1 0.24 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.18 0.28 3.98 2.06 4.06 1.80 2021-08-06 06:00:00 2.0 1.0 N Away Home 3.0 1.0 No Match No Sel.
1 1 2021-08-06 07:00:00 Russia Youth League Ural U19 Krylya Sovetov Samara U19 2.10 3.68 2.98 1.25 1.33 0.44 0 0.40 0 0.04 0 0.56 1 0.47 0 0.60 1 0.35 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 0.08 2.46 2.46 4.59 2.40 2021-08-06 07:00:00 1.0 1.0 N Away Draw 2.0 0.0 No Match No Sel.
2 2 2021-08-06 08:00:00 World Club Friendly Alaves Al Nasr 1.94 3.63 3.42 1.84 1.39 0.68 1 0.42 1 0.19 0 0.64 1 0.42 0 0.66 1 0.34 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.23 0.45 3.69 3.69 6.50 2.06 NaN NaN NaN N Home NaN NaN NaN No Result No Sel.
3 3 2021-08-06 08:30:00 China Jia League Chengdu Rongcheng Shenyang Urban FC 1.16 6.87 15.29 3.07 0.52 0.50 0 0.37 0 0.04 0 0.81 1 0.26 0 0.84 1 0.10 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.59 2.55 2.42 1.85 4.80 2.20 2021-08-06 08:30:00 2.0 0.0 W Home Home 2.0 2.0 Match Match
4 4 2021-08-06 08:30:00 China Super League Wuhan FC Tianjin Jinmen Tiger 1.56 3.69 6.71 1.90 0.68 0.50 0 0.40 0 0.21 0 0.51 0 0.58 1 0.63 1 0.37 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 1.22 2.08 1.50 3.20 2.30 2021-08-06 08:30:00 1.0 1.0 N Home Draw 2.0 0.0 No Match No Sel.
import pandas as pd
df['run_time'] = pd.to_datetime(df['run_time'])
df['match_date'] = pd.to_datetime(df['match_datetime']).dt.date
# Keeping the top 2 dates for every runtime to have a smaller dataframe
df_tt = df[['run_time', 'match_date']]
df_tt = df_tt.drop_duplicates()
df_tt = df_tt.sort_values(['run_time', 'match_date'], ascending=[True, True])
df_tt = df_tt.loc[df_tt.groupby(df['run_time']).head(2).index].sort_index()
df_tt.head()
run_time match_date
0 2021-08-07 00:04:36.326391 2021-08-06
188 2021-08-07 00:04:36.326391 2021-08-05
758 2021-08-07 10:50:34.574040 2021-08-07
1407 2021-08-07 10:50:34.574040 2021-08-06
1884 2021-08-07 16:56:22.322338 2021-08-07
I want the resulting dataframe top be a filtered dataframe consisting of the values of df_tt
df_tab = df.reset_index().merge(df_tt, how="left", left_on=['run_time', 'match_date'],
right_on=['run_time', 'match_date']).set_index('index')
I am unsure if I am writing the merge function correctly however I am not getting the dataframe I want rather I am getting the original dataframe df
as it is.
df_tab.head()
level_0 match_datetime country league home_team away_team home_odds draw_odds away_odds predicted_home_score predicted_away_score predicted_btts predicted_class_btts predicted_bttso2.5 predicted_class_bttso2.5 predicted_bttsu2.5 predicted_class_bttsu2.5 predicted_o2.5 predicted_class_o2.5 predicted_u2.5 predicted_class_u2.5 predicted_o1.5 predicted_class_o1.5 predicted_u1.5 predicted_class_u1.5 run_time predicted_o4 predicted_class_o4 predicted_u4 predicted_class_u4 predicted_draw predicted_class_draw predicted_total_score predicted_score_difference Win DNB O 1.5 U 4.5 datetime home_score away_score selection predicted_result result total_score score_difference result_match selection_match match_date
index
0 0 2021-08-06 06:00:00 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka 2.61 3.51 2.32 1.45 1.73 0.66 1 0.56 0 0.11 0 0.62 1 0.33 0 0.76 1 0.24 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.18 0.28 3.98 2.06 4.06 1.80 2021-08-06 06:00:00 2.0 1.0 N Away Home 3.0 1.0 No Match No Sel. 2021-08-06
1 1 2021-08-06 07:00:00 Russia Youth League Ural U19 Krylya Sovetov Samara U19 2.10 3.68 2.98 1.25 1.33 0.44 0 0.40 0 0.04 0 0.56 1 0.47 0 0.60 1 0.35 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 0.08 2.46 2.46 4.59 2.40 2021-08-06 07:00:00 1.0 1.0 N Away Draw 2.0 0.0 No Match No Sel. 2021-08-06
2 2 2021-08-06 08:00:00 World Club Friendly Alaves Al Nasr 1.94 3.63 3.42 1.84 1.39 0.68 1 0.42 1 0.19 0 0.64 1 0.42 0 0.66 1 0.34 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.23 0.45 3.69 3.69 6.50 2.06 NaN NaN NaN N Home NaN NaN NaN No Result No Sel. 2021-08-06
3 3 2021-08-06 08:30:00 China Jia League Chengdu Rongcheng Shenyang Urban FC 1.16 6.87 15.29 3.07 0.52 0.50 0 0.37 0 0.04 0 0.81 1 0.26 0 0.84 1 0.10 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.59 2.55 2.42 1.85 4.80 2.20 2021-08-06 08:30:00 2.0 0.0 W Home Home 2.0 2.0 Match Match 2021-08-06
4 4 2021-08-06 08:30:00 China Super League Wuhan FC Tianjin Jinmen Tiger 1.56 3.69 6.71 1.90 0.68 0.50 0 0.40 0 0.21 0 0.51 0 0.58 1 0.63 1 0.37 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 1.22 2.08 1.50 3.20 2.30 2021-08-06 08:30:00 1.0 1.0 N Home Draw 2.0 0.0 No Match No Sel. 2021-08-06
How can I write the merge function correctly?
CodePudding user response:
If all columns in df_tt
are used as key columns, no value/additional columns are appended to df
.
If you try
df.reset_index().merge(df_tt, how="left", on='match_date').set_index('index')
you should be able to see run_time_x
and run_time_y
in result.
CodePudding user response:
This is what I tried and works.
import pandas as pd
df['run_time'] = pd.to_datetime(df['run_time'])
df['match_date'] = pd.to_datetime(df['match_datetime']).dt.date
# Keeping the top 2 dates for every runtime to have a smaller dataframe
df_tt = df[['run_time', 'match_date']]
df_tt = df_tt.drop_duplicates()
df_tt = df_tt.sort_values(['run_time', 'match_date'], ascending=[True, True])
df_tt = df_tt.loc[df_tt.groupby(df['run_time']).head(2).index].sort_index()
df_tt.head()
run_time match_date
0 2021-08-07 00:04:36.326391 2021-08-06
188 2021-08-07 00:04:36.326391 2021-08-05
758 2021-08-07 10:50:34.574040 2021-08-07
1407 2021-08-07 10:50:34.574040 2021-08-06
1884 2021-08-07 16:56:22.322338 2021-08-07
I then concat the two columns as I was unsure how merge works with multiple columns
df['concat'] = df['run_time'].astype(str) df['match_date'].astype(str)
df_tt['concat'] = df_tt['run_time'].astype(str) df_tt['match_date'].astype(str)
Then I did a pd.merge
df_tab = df.merge(df_tt, left_on='concat', right_on='concat')
That resulted in a smaller and a filtered dataframe
index match_datetime country league home_team away_team home_odds draw_odds away_odds predicted_home_score predicted_away_score predicted_btts predicted_class_btts predicted_bttso2.5 predicted_class_bttso2.5 predicted_bttsu2.5 predicted_class_bttsu2.5 predicted_o2.5 predicted_class_o2.5 predicted_u2.5 predicted_class_u2.5 predicted_o1.5 predicted_class_o1.5 predicted_u1.5 predicted_class_u1.5 run_time_x predicted_o4 predicted_class_o4 predicted_u4 predicted_class_u4 predicted_draw predicted_class_draw predicted_total_score predicted_score_difference Win DNB O 1.5 U 4.5 datetime home_score away_score selection predicted_result result total_score score_difference result_match selection_match match_date_x concat run_time_y match_date_y
0 0 2021-08-06 06:00:00 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka 2.61 3.51 2.32 1.45 1.73 0.66 1 0.56 0 0.11 0 0.62 1 0.33 0 0.76 1 0.24 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.18 0.28 3.98 2.06 4.06 1.80 2021-08-06 06:00:00 2.0 1.0 N Away Home 3.0 1.0 No Match No Sel. 2021-08-06 2021-08-07 00:04:36.3263912021-08-06 2021-08-07 00:04:36.326391 2021-08-06
1 1 2021-08-06 07:00:00 Russia Youth League Ural U19 Krylya Sovetov Samara U19 2.10 3.68 2.98 1.25 1.33 0.44 0 0.40 0 0.04 0 0.56 1 0.47 0 0.60 1 0.35 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 0.08 2.46 2.46 4.59 2.40 2021-08-06 07:00:00 1.0 1.0 N Away Draw 2.0 0.0 No Match No Sel. 2021-08-06 2021-08-07 00:04:36.3263912021-08-06 2021-08-07 00:04:36.326391 2021-08-06
2 2 2021-08-06 08:00:00 World Club Friendly Alaves Al Nasr 1.94 3.63 3.42 1.84 1.39 0.68 1 0.42 1 0.19 0 0.64 1 0.42 0 0.66 1 0.34 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.23 0.45 3.69 3.69 6.50 2.06 NaN NaN NaN N Home NaN NaN NaN No Result No Sel. 2021-08-06 2021-08-07 00:04:36.3263912021-08-06 2021-08-07 00:04:36.326391 2021-08-06
3 3 2021-08-06 08:30:00 China Jia League Chengdu Rongcheng Shenyang Urban FC 1.16 6.87 15.29 3.07 0.52 0.50 0 0.37 0 0.04 0 0.81 1 0.26 0 0.84 1 0.10 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 3.59 2.55 2.42 1.85 4.80 2.20 2021-08-06 08:30:00 2.0 0.0 W Home Home 2.0 2.0 Match Match 2021-08-06 2021-08-07 00:04:36.3263912021-08-06 2021-08-07 00:04:36.326391 2021-08-06
4 4 2021-08-06 08:30:00 China Super League Wuhan FC Tianjin Jinmen Tiger 1.56 3.69 6.71 1.90 0.68 0.50 0 0.40 0 0.21 0 0.51 0 0.58 1 0.63 1 0.37 0 2021-08-07 00:04:36.326391 NaN NaN NaN NaN NaN NaN 2.58 1.22 2.08 1.50 3.20 2.30 2021-08-06 08:30:00 1.0 1.0 N Home Draw 2.0 0.0 No Match No Sel. 2021-08-06 2021-08-07 00:04:36.3263912021-08-06 2021-08-07 00:04:36.326391 2021-08-06
probably not an elegant solution but there can be a better way to approach this