how to merge multiple datasets with differences in merge-index strings?-CodePudding

Hello I am struggling to find a solution to probably a very common problem.

I want to merge two csv-files with soccer data. They basically store different data of the same games. Normally I would do a merge with .merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. So for example Manchester City is called Man. City in the second data frame.

Here's roughly what df1 and df2 look like:

df:

team1            team2     date                      some_value_i_want_to_compare
Manchester City  Arsenal   2022-05-20 22:00:00 0.2812  5

df2:

team1       team2     date                      some_value_i_want_to_compare
Man. City   Arsenal   2022-05-20 22:00:00 0.2812  3

Note that in the above case there are only differences in team1 but there could also be cases where team2 is slightly different. So for example in this case Arsenal could be called FC Arsenal in the second data set.

So my main question is: How could I automatically analyse the differences in the two datasets naming?

My second question is: How do I scale this for more than 2 data sets so that the number of data sets ultimately doesn't matter?

CodePudding user response：

You could start by doing an anti-join to isolate the ones that don't match:

# Merge two team datasets
teams_join = df1.merge(df2, on='team1', 
                                 how='left', indicator=True)

# Select the team1 column where _merge is left_only
team_list = teams_join.loc[teams_join['_merge'] == 'left_only', 'team1']

# print team names in df1 with no match in df2
print(df1[df1["team1"].isin(team_list)])

This will give you all the teams in df1 without a match in df2. You could do the same for df2 (just reverse everything df1 and df2 in the previous code). Then you can take those two lists with the names that don't match and manually rename them if there are few enough of them.

CodePudding user response：

As commenters and existing answer have suggested, if the number of unique names is not too large, then you can manually extract the mismatches and correct them. That is probably the best solution unless the number of mismatches is very large.

Another case which can occur, is when you have a ground truth list of allowed indexes (for example, the list of all soccer teams in a given league), but the data may contain many different attempts at spelling or abbreviating each team. If this is similar to your situation, you can use difflib to search for the most likely match for a given name. For example:

import difflib

true_names = ['Manchester United', 'Chelsea']
mismatch_names = ['Man. Unites', 'Chlsea', 'Chelsee']
best_matches = [difflib.get_close_matches(x, true_names, n=1) for x in mismatch_names]
for old,new in zip(mismatch_names, best_matches):
    print(f"Best match for {old} is {new[0]}")

output:

Best match for Man. Unites is Manchester United
Best match for Chlsea is Chelsea
Best match for Chelsee is Chelsea

Note if the spelling is very bad, you can ask difflib to find the closest n matches using the n= keyword argument. This can help to reduce manual data cleaning work, although it is often unavoidable, at least to some degree.

Hope it helps.