Check if a pandas.core.series.Series contains a specific string in Python-CodePudding

I have a dataframe (df1) that looks like this:

first_name	last_name	affiliation
jean	dulac	University of Texas
peter	huta	University of Maryland

I want to match this dataframe to another one that contains several potential matches. Each potential match has a first and last name and also a list of all the affiliations this person was associated with, and I want to use the information in this affiliation column to differentiate between my potential matches and keep only the most likely one.

The second dataframe has the following form:

first_name	last_name	affiliations_all
jean	dulac	[{'city_name': 'Kyoto', 'country_name': 'Japan', 'name': 'Kyoto University'}]
jean	dulac	[{'city_name': 'Texas', 'country_name': 'USA', 'name': 'University of Texas'}]

The column affiliations_all is apparently saved as a pandas.core.series.Series (and I can't change that since it comes from an API query).

I am thinking that one way to match the 2 dataframes would be to remove the words like "university" and "of" from the affiliation column of the first dataframe (that's easy), do the same for the affiliations_all column of the second dataframe (don't know how to do that) and then run some version of

test.apply(lambda x: str(x.affiliation) in str(x.affiliations_all), axis=1)

adapted to the fact that affiliations_all is a series.Series.

Any idea how to do that?

Thanks!

CodePudding user response：

One possible solution would be to transform df2 (expand the columns) and then merge df1 with df2:

# transform df2
df2 = df2.explode("affiliations_all")
df2 = pd.concat([df2, df2.pop("affiliations_all").apply(pd.Series)], axis=1)
df2 = df2.rename(columns={"name": "affiliation"})

print(df2)

This prints:

  first_name last_name city_name country_name          affiliation
0       jean     dulac     Kyoto        Japan     Kyoto University
1       jean     dulac     Texas          USA  University of Texas

And the seconds step will be merge df1 with transformed df2:

df_out = pd.merge(df1, df2, on=["first_name", "last_name", "affiliation"])

print(df_out)

Prints:

  first_name last_name          affiliation city_name country_name
0       jean     dulac  University of Texas     Texas          USA