Home > OS >  Comparing various (but not all) columns of two different sized dataframes and select only those rows
Comparing various (but not all) columns of two different sized dataframes and select only those rows

Time:07-20

I have two datframes that have different numbers of rows and well as different numbers of columns.

row_List1:

        date   team_home  team_away   goals_home   goals_away   shootout_win   competition

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018
2 2018-06-06 Armenia Moldova 0 0 NaN Friendly 2018
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018

row_List2:

date team_home team_away goals_home goals_away shootout_win competition venue

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-05 USA Pakistan 8 5 NaN Friendly 2018 Nuetral
3 2018-06-06 Moldova Armenia 0 0 NaN Friendly 2018 Away
4 2018-06-07 India Srilanka 2 0 NaN Friendly 2018 Home
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away
6 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home

So row_List2 has more columns and more rows than row_List1.

row_List2 has venues of all matches. I need the to add a column venues in row_List1 and check for a match in row_List1, if it exists in row_List2, I need to extract the venue and add to the new column in row_List1.

I tried the below code:

# row_list1['venue'] = np.where(((row_list1['date'] == row_list2['date']) and (row_list1['team_home'] == row_list2['team_home'] or row_list1['team_home'] == row_list2['team_away']) and (row_list1['team_away'] == row_list2['team_away'] or row_list1['team_away'] == row_list2['team_home']) and (row_list1['goals_home'] == row_list2['goals_home'] or row_list1['goals_home'] == row_list2['goals_away']) and (row_list1['goals_away'] == row_list2['goals_away'] or row_list1['goals_away'] == row_list2['goals_home'])), row_list2['venue'], np.NaN)

These are the conditions I need but the above code gives me an error:

ValueError: Can only compare identically-labeled Series objects

Now one more problem is that the team_home and team_away may be switched in row_List2. So I need to check:

if row_list1['team_home'] == row_list2['team_home'] or row_list1['team_home'] == row_list2['team_away']) and (row_list1['team_away'] == row_list2['team_away'] or row_list1['team_away'] == row_list2['team_home']) and (row_list1['goals_home'] == row_list2['goals_home'] or row_list1['goals_home'] == row_list2['goals_away']) and (row_list1['goals_away'] == row_list2['goals_away'] or row_list1['goals_away'] == row_list2['goals_home'])

What I want as an output is:

row_List1:

        date   team_home  team_away   goals_home   goals_away   shootout_win   competition     venue

1 2018-06-04 India Kenya 3 0 NaN Friendly 2018 Home
2 2018-06-06 Armenia Moldova 0 0 NaN Friendly 2018 Away
3 2018-06-09 Italy Netherlands 1 1 NaN Friendly 2018 Away

Can Anyone please help?

CodePudding user response:

This is kind of hacky but it works. Note that the Armenia-Moldova games don't actually match in your dataframes (they're flipped home/away). I had to .fillna() before performing the comparison because np.nan doesn't == np.nan.

>>> for df in [df1, df2]:
...    df.fillna(0, inplace=True)

>>> df1[[df2.drop('venue', axis=1).eq(r).all(axis=1).any() for r in df1.itertuples(index=False)]]

    date    team_home   team_away   goals_home  goals_away  shootout_win    competition year
0   2018-06-04  India   Kenya   3   0   0.0 Friendly    2018
2   2018-06-09  Italy   Netherlands 1   1   0.0 Friendly    2018

CodePudding user response:

Is this what you are looking for?

df = pd.merge(row_List1,row_List2.drop_duplicates(),how = 'left')

Output:

       date team_home    team_away  goals_home  goals_away  shootout_win  \
0  6/4/2018     India        Kenya           3           0           NaN   
1  6/6/2018   Armenia      Moldova           0           0           NaN   
2  6/9/2018     Italy  Netherlands           1           1           NaN   

  competition  year venue  
0    Friendly  2018  Home  
1    Friendly  2018   NaN  
2    Friendly  2018  Away
  • Related