Hi I have a df like the following:
Col1 Col2
SM_ SM_
SM_ N_
EX_,SM_ EX_,CO_
SL_,N_ PD_,SL_
I want to compare both columns, and see if a value in Col1 is present in Col2 or not. Multiple values in both columns are comma separated. So, ideally the result of the row wise comparison shown above should be:
True
False
True
True
I have the following code right now:
finaldf['C'] = finaldf.apply(lambda x: x.Extracted_pillars in x.min_pillar_score, axis=1)
This works for single code values, but when each column has multiple values, this does not work properly.
CodePudding user response:
You could use a function that for each cell constructs a set of the comma-separated values and returns whether the intersection of those sets in a row is not empty:
import pandas as pd
def columns_overlap(row):
sets = [set(s.split(',')) for s in row]
intersection = set.intersection(*sets)
return intersection != set()
df = pd.DataFrame({'Col1': ['SM_', 'SM_', 'EX_,SM_', 'SL_,N_'],
'Col2': ['SM_', 'N_', 'EX_,CO_', 'PD_,SL_']})
df['C'] = df.apply(columns_overlap, axis=1)
This results in the following dataframe:
Col1 Col2 C
0 SM_ SM_ True
1 SM_ N_ False
2 EX_,SM_ EX_,CO_ True
3 SL_,N_ PD_,SL_ True
CodePudding user response:
Your solution is good, but you need to split each string by comma and use any
.
EDIT:
It turns out there are cases when there is a space before or after comma. Then you can use:
finaldf['C'] = finaldf.apply(lambda x: any(i in x.Col1.replace(' ', '').split(',')
for i in x.Col2.replace(' ', '').split(',')),
axis=1)
Output:
0 True
1 False
2 True
3 True