I have a dataframe (df1) that looks like this:
first_name | last_name | affiliation |
---|---|---|
jean | dulac | University of Texas |
peter | huta | University of Maryland |
I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.
The second dataframe has the following form:
first_name | last_name | affiliations_all |
---|---|---|
jean | dulac | [{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}] |
jean | dulac | [{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}] |
The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).
I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of
test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)
adapted to the fact that affiliations_all is a series.Series.
Any idea how to do that?
Thanks!
CodePudding user response:
One possible solution would be to transform df2
(expand the columns) and then merge df1
with df2
:
# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})
print(df2)
This prints:
first_name last_name city_name country_name affiliation
0 jean dulac Kyoto Japan Kyoto University
1 jean dulac Texas USA University of Texas
And the seconds step will be merge df1
with transformed df2
:
df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])
print(df_out)
Prints:
first_name last_name affiliation city_name country_name
0 jean dulac University of Texas Texas USA