Basically there are two columns: Customer_column and Company_column. Customer_column is taking input from customer also there is a rule to insert data in (OR,AND) condition.
I want check one column value into the another column. The tricky part is like there is an OR , AND condition values available in column which I want to check. The pipe (|) symbol stands for OR condition, and the comma (,) stands for AND condition.
For example: if my one column (Customer_column) has (Oil|Leak) that means I want to check 'oil' or 'leak'. One of them should be available in respective column (Company_column). If it contains (Oil|Leak,engine) that means I want to check whether 'oil' word should be available or 'leak' and 'engine' word should be available in the respective column.
Below is my data frame:
import pandas as pd
data = {'Customer_column': ['(Oil|Leak)', '(Oil|Leak,engine)', '(Oil|Leak,engine)', '(Oil|Leak,radiator)', '(Leak|Mark,water)'],
'Company_column': ['(leak is present in radiator)', '(oil is leaking)', '(there is a water mark at engine)', '(water is available in radiator)', '(there is a water mark at engine)']
}
df = pd.DataFrame(data)
print (df)
Below is my expected output:
data = {'Customer_column': ['(Oil|Leak)', '(Oil|Leak,engine)', '(Oil|Leak,engine)', '(Oil|Leak,radiator)', '(Leak|Mark,water)'],
'Company_column': ['(leak is present in radiator)', '(oil is leaking)', '(there is a water mark at engine)', '(water is available in radiator)', '(there is a water mark at engine)'],
'Result': ['Leak', 'Oil', 'None', 'None', 'Water,Mark'],
}
df = pd.DataFrame(data)
print (df)
I tried regex and contain method to solve this. For the OR condition I got my result but I am getting wrong output where the AND condition is written.
import re
df['match'] = [m.group() if (m:=re.search(fr'\b{re.escape(b)}\b', a, flags=re.I)) else None
for a,b in zip(df['Customer_column'], df['Company_column'])]
The second code I tried:
df['match'] = [b.casefold() in a.casefold().split(',')for a,b in zip(df['Customer_column'], df['Company_column'])]
CodePudding user response:
I think a regex is not best suited here. You could rather use set
operations to determine if all words of a kind are present:
def matches(cust, comp):
words_comp = set(comp[1:-1].casefold().split())
return ' '.join([x for x in cust[1:-1].split('|')
if set(x.casefold().split(','))
.issubset(words_comp)
])
df['match'] = [matches(cust, comp) for cust, comp in
zip(df['Customer_column'], df['Company_column'])]
# or
# df['match'] = df.apply(lambda r: matches(r['Customer_column'],
# r['Company_column']),
# axis=1)
output:
NB. I added one more line to show what happens on multiple matches.
Customer_column Company_column match
0 (Oil|Leak) (leak is present in radiator) Leak
1 (Oil|Leak,engine) (oil is leaking) Oil
2 (Oil|Leak,engine) (there is a water mark at engine)
3 (Oil|Leak,radiator) (water is available in radiator)
4 (Leak|Mark,water) (there is a water mark at engine) Mark,water
5 (Leak|Mark,water) (there is a water leak mark at engine) Leak Mark,water