I'm trying to check for string matches between two dataframes and return the name of the column the match was found in.
I need to include all matches found in an exact match to be returned as a list in df1.
My Dataframes
df1 (Data to lookup is in the title column)
title counts
0 nike 3
1 adidas 2
2 rebok 2
df2 (Dataframe to search for matches in)
0 1 2
0 nike adidas rebok
1 nike rebok hitech
2 converse puma converse
3 hitech adidas nike
Desired Output
title counts matching_cols
0 nike 3 [0,0,2]
1 adidas 2 [1,1]
2 rebok 2 [1,2]
Minimum Reproducible Example
# make df1
data = [['nike', 3], ['adidas', 2], ['rebok', 2]]
df_counts = pd.DataFrame(data, columns=['title', 'counts'])
# make df2
lu_data = [{0: 'nike', 1: 'adidas', 2: 'rebok'},
{0: 'nike', 1: 'rebok', 2: 'hitech'},
{0: 'converse', 1: 'puma', 2: 'converse'},
{0: 'hitech', 1: 'adidas', 2: 'nike'}
]
df_words = pd.DataFrame(lu_data)
What I've tried: I'm a little bit stuck reading a lot of similar threads on stackoverflow / google. The issue I'm having is most solutions are looking to match the name of the column, whereas I'm not looking to return the name of a column when a string matches.
CodePudding user response:
Melt
the dataframe then groupby
and agg with list
to create a mapping dict that maps title to list of column names, then use this dictionary to substitute values in title
column of df_counts
d = df_words.melt().groupby('value')['variable'].agg(list)
df_counts['matching_cols'] = df_counts['title'].map(d)
Result
title counts matching_cols
0 nike 3 [0, 0, 2]
1 adidas 2 [1, 1]
2 rebok 2 [1, 2]