Home > Net >  Merge and match DataFrames with similar values
Merge and match DataFrames with similar values

Time:12-07

Im trying to merge and match two dataframes with names, one DF only has names and one other column whilst the other has names and alot of different columns.

df1
| Player          | Rating |
| O. Fuglsang     | 53     |
| J. Jensen-Abbew | 53     |
| M. Enggård      | 53     |
| J. Liburd       | 53     |

df2
| Player       | Team | Age | Value |
| R. Mudražija | a1   |a2   |a3     |
| J. Gomez     | b1   |b2   |b3     |
| T. Mølgaard  | c1   |c2   |c2     |
| J. Liburd    | d1   |d2   |d3     |

I have roughly 200 players in each dataframe and almost all of them are the same, but maybe 10 players are different in the dataframes, but the format of the names should be the same.

So im trying to merge and match the DF to include the rating from df1 into df2 with all the other stats df2 has for each player.

I tried with the following code:


dfmerged = df1.merge(df2,
          left_on=df1['Player'].str.replace(r'^(.)\w ', r'\1.', regex=True),
          right_on='Player'
         )

However when i run this my dfmerged dataframe is empty.

Im unfamiliar with the regular expression library but when searching here on stackoverflow most of the "solutions" i find use regex.

CodePudding user response:

merged_df = DF2.merge(DF1, how = 'inner', on = ['date', 'hours'])

This will perform and "inner-join" thereby omitting rows in each dataframe that do not match. Hence, no NaN in either the right or left part of merged dataframe.

CodePudding user response:

This works in your example, It will take inner as default join and you don't have to define the key because its same in both table. Pass 'left', 'right' in 'how' argument to define the join type in merge function. I correct your regex and apply your example in following code,

import pandas as pd
import re
df_1 = pd.DataFrame(data=[
    {'Player': 'O. Fuglsang'},
    {'Player': 'J. Jensen-Abbew'},
    {'Player': 'M. Enggård'},
    {'Player': 'J. Liburd'}
])
df_2 = pd.DataFrame(data=[
    {'Player': 'R. Mudražija', 'Team': 'a1', 'Age': 'a2', 'Value': 'a3'},
    {'Player': 'J. Gomez', 'Team': 'b1', 'Age': 'b2', 'Value': 'b3'},
    {'Player': 'T. Mølgaard', 'Team': 'c1', 'Age': 'c2', 'Value': 'c3'},
    {'Player': 'J. Liburd', 'Team': 'd1', 'Age': 'd2', 'Value': 'd3'}
])
result_inner_without_regex = df_1.merge(df_2) # by default it takes inner join
def replace_name(name):
    return re.sub('\w \.', '', name)
df_1['Player'] = df_1['Player'].apply(replace_name)
df_2['Player'] = df_2['Player'].apply(replace_name)
result_with_regex_left = df_1.merge(df_2, how='left') # player as a key because only in these table
result_with_regex_left
  • Related