I'm trying to make a Champions League Fantasy statistic table, and I'm getting data from 2 different sites that name the player slightly different between each other.
I have df1
from site 1:
name age team skill cost gls ast
0 Lionel Messi 34-175 Paris 4 11.3 5 0
1 Ryan Gravenberch 19-214 Ajax 3 6.2 0 0
2 Junior Messias 30-217 Milan 3 6.5 1 0
3 Kepa Arrizabalaga 27-074 Chelsea 1 5.0 0 0
4 Kenneth Taylor 19-214 Ajax 3 5.0 0 0
5 Alisson 30-320 Liverpool 1 6.1 0 0
And df2
from site 2:
name age team gls ast
0 Kepa 27-074 Chelsea 0 0
1 Lionel 34-175 Paris 5 0
2 Junior 30-217 Milan 1 0
3 Kenneth 19-214 Ajax 0 0
4 Neymar 29-314 Paris 0 0
5 Ryan 19-214 Ajax 0 0
My goal is to match the names based on multiple conditions:
- Age (string in
df2
equal to string indf1
) - Team (string in
df2
equal to string indf1
) - Name (string in
df2
is contained in string indf1
)
The reason I want to pass the name as last condition is because there are cases were two players were born the exact same day and play for the same team like Kenneth Taylor and Ryan Gravenberch
I'm thinking of something like this:
df2.loc[(df2['team'] == df1['team']) & (df2['age'] == df1['age']) & (df2['name'].str.contains(df1['name'].str)), 'name'] = df1['name']
But I'm getting this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
The desire output for df2
is:
name age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0 0
1 Lionel Messi 34-175 Paris 5 0
2 Junior Messias 30-217 Milan 1 0
3 Kenneth Taylor 19-214 Ajax 0 0
4 Neymar 29-314 Paris 0 0
5 Ryan Gravenberch 19-214 Ajax 0 0
Where all the names from df2
that matched the conditions were replaced with the names from df1
CodePudding user response:
def name_match(row):
if len(row['potential_match']) > len(row['name']):
larger = row['potential_match']
smaller = row['name']
else:
larger = row['name']
smaller = row['potential_match']
if smaller in larger:
return row['potential_match']
else:
return row['name']
df2['potential_match'] = df2[['age', 'team']].merge(
df1[['name', 'age', 'team']],
on=['age', 'team'],
how='left')['name']
df2['name'] = df2.apply(name_match , axis=1)
df2.drop('potential_match', axis=1, inplace=True)
CodePudding user response:
Use this to get the answer you desire. There is no need of separate condition again based on name.
df2.loc[(df2['team'] == df1['team']) & (df2['gls'] == df1['gls']), 'name'] = df1['name']
CodePudding user response:
Try with merge
:
matches = df2.merge(df1[["name", "age", "team"]],
on=["age", "team"],
how="left")
matches["name_y"] = matches["name_y"].fillna(matches["name_x"])
matches = matches.where(matches.apply(lambda x: x["name_x"] in x["name_y"], axis=1)).dropna()
output = matches.drop("name_x", axis=1).rename(columns={"name_y": "name"}).reindex(df2.columns, axis=1)
>>> output
name age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0.0 0.0
1 Lionel Messi 34-175 Paris 5.0 0.0
2 Junior Messias 30-217 Milan 1.0 0.0
4 Kenneth Taylor 19-214 Ajax 0.0 0.0
5 Neymar 29-314 Paris 0.0 0.0
6 Ryan Gravenberch 19-214 Ajax 0.0 0.0
CodePudding user response:
(i) Merge df2
from the right to df1
on age
and team
.
(ii) Assign names in df2
but not in df1
to the name
column from df1
(which is name_x
).
(iii) Filter out the names not matching across name_x
and name_y
columns and drop name_y
.
df3 = df1[['name','age','team']].merge(df2, on=['age','team'], how='right')
mask = pd.isna(df3['name_x'])
df3.loc[mask,'name_x'] = df3.loc[mask,'name_y'].to_numpy()
df3 = df3[df3.apply(lambda x: x['name_y'] in x['name_x'], axis=1)].drop('name_y', axis=1)
Output:
name_x age team gls ast
0 Kepa Arrizabalaga 27-074 Chelsea 0 0
1 Lionel Messi 34-175 Paris 5 0
2 Junior Messias 30-217 Milan 1 0
4 Kenneth Taylor 19-214 Ajax 0 0
5 Neymar 29-314 Paris 0 0
6 Ryan Gravenberch 19-214 Ajax 0 0