Pandas - check for string matches in different columns with column values being comma separated-CodePudding

Hi I have a df like the following:

Col1         Col2

SM_          SM_
SM_          N_
EX_,SM_      EX_,CO_
SL_,N_       PD_,SL_

I want to compare both columns, and see if a value in Col1 is present in Col2 or not. Multiple values in both columns are comma separated. So, ideally the result of the row wise comparison shown above should be:

True
False
True
True

I have the following code right now:

finaldf['C'] = finaldf.apply(lambda x: x.Extracted_pillars in x.min_pillar_score, axis=1)

This works for single code values, but when each column has multiple values, this does not work properly.

CodePudding user response：

You could use a function that for each cell constructs a set of the comma-separated values and returns whether the intersection of those sets in a row is not empty:

import pandas as pd

def columns_overlap(row):
    sets = [set(s.split(',')) for s in row]
    intersection = set.intersection(*sets)
    return intersection != set()

df = pd.DataFrame({'Col1': ['SM_', 'SM_', 'EX_,SM_', 'SL_,N_'],
                   'Col2': ['SM_', 'N_', 'EX_,CO_', 'PD_,SL_']})

df['C'] = df.apply(columns_overlap, axis=1)

This results in the following dataframe:

    Col1     Col2     C
0   SM_      SM_      True
1   SM_      N_       False
2   EX_,SM_  EX_,CO_  True
3   SL_,N_   PD_,SL_  True

CodePudding user response：

Your solution is good, but you need to split each string by comma and use any.

EDIT:

It turns out there are cases when there is a space before or after comma. Then you can use:

finaldf['C'] = finaldf.apply(lambda x: any(i in x.Col1.replace(' ', '').split(',')
                                           for i in x.Col2.replace(' ', '').split(',')),
                                           axis=1)

Output:

0     True
1    False
2     True
3     True