Home > OS >  Check if string in df1 exists anywhere in df2 and return the matching column name(s) in df1
Check if string in df1 exists anywhere in df2 and return the matching column name(s) in df1

Time:09-26

I'm trying to check for string matches between two dataframes and return the name of the column the match was found in.

I need to include all matches found in an exact match to be returned as a list in df1.

My Dataframes

df1 (Data to lookup is in the title column)

    title  counts
0    nike       3
1  adidas       2
2   rebok       2

df2 (Dataframe to search for matches in)

          0       1         2
0      nike  adidas     rebok
1      nike   rebok    hitech
2  converse    puma  converse
3    hitech  adidas      nike

Desired Output

    title  counts  matching_cols
0    nike       3  [0,0,2]
1  adidas       2  [1,1]
2   rebok       2  [1,2]

Minimum Reproducible Example

# make df1
data = [['nike', 3], ['adidas', 2], ['rebok', 2]]
df_counts = pd.DataFrame(data, columns=['title', 'counts'])

# make df2
lu_data = [{0: 'nike', 1: 'adidas', 2: 'rebok'},
        {0: 'nike', 1: 'rebok', 2: 'hitech'},
        {0: 'converse', 1: 'puma', 2: 'converse'},
        {0: 'hitech', 1: 'adidas', 2: 'nike'}
        ]

df_words = pd.DataFrame(lu_data)

What I've tried: I'm a little bit stuck reading a lot of similar threads on stackoverflow / google. The issue I'm having is most solutions are looking to match the name of the column, whereas I'm not looking to return the name of a column when a string matches.

CodePudding user response:

Melt the dataframe then groupby and agg with list to create a mapping dict that maps title to list of column names, then use this dictionary to substitute values in title column of df_counts

d = df_words.melt().groupby('value')['variable'].agg(list)
df_counts['matching_cols'] = df_counts['title'].map(d)

Result

    title  counts matching_cols
0    nike       3     [0, 0, 2]
1  adidas       2        [1, 1]
2   rebok       2        [1, 2]
  • Related