I am stuck in writing one python code. Actually there is a one little tricky condition-CodePudding

Basically there are two columns: Customer_column and Company_column. Customer_column is taking input from customer also there is a rule to insert data in (OR,AND) condition.

I want check one column value into the another column. The tricky part is like there is an OR , AND condition values available in column which I want to check. The pipe (|) symbol stands for OR condition, and the comma (,) stands for AND condition.

For example: if my one column (Customer_column) has (Oil|Leak) that means I want to check 'oil' or 'leak'. One of them should be available in respective column (Company_column). If it contains (Oil|Leak,engine) that means I want to check whether 'oil' word should be available or 'leak' and 'engine' word should be available in the respective column.

Below is my data frame:

import pandas as pd

data = {'Customer_column': ['(Oil|Leak)', '(Oil|Leak,engine)', '(Oil|Leak,engine)', '(Oil|Leak,radiator)', '(Leak|Mark,water)'],
        'Company_column': ['(leak is present in radiator)', '(oil is leaking)', '(there is a water mark at engine)', '(water is available in radiator)', '(there is a water mark at engine)']
        }

df = pd.DataFrame(data)

print (df)

Below is my expected output:

data = {'Customer_column': ['(Oil|Leak)', '(Oil|Leak,engine)', '(Oil|Leak,engine)', '(Oil|Leak,radiator)', '(Leak|Mark,water)'],
        'Company_column': ['(leak is present in radiator)', '(oil is leaking)', '(there is a water mark at engine)', '(water is available in radiator)', '(there is a water mark at engine)'],
        'Result': ['Leak', 'Oil', 'None', 'None', 'Water,Mark'],
        }

df = pd.DataFrame(data)

print (df)

I tried regex and contain method to solve this. For the OR condition I got my result but I am getting wrong output where the AND condition is written.

import re
df['match'] = [m.group() if (m:=re.search(fr'\b{re.escape(b)}\b', a, flags=re.I)) else None
               for a,b in zip(df['Customer_column'], df['Company_column'])]

The second code I tried:

df['match'] = [b.casefold() in a.casefold().split(',')for a,b in zip(df['Customer_column'], df['Company_column'])]

CodePudding user response：

I think a regex is not best suited here. You could rather use set operations to determine if all words of a kind are present:

def matches(cust, comp):
    words_comp = set(comp[1:-1].casefold().split())
    return ' '.join([x for x in cust[1:-1].split('|')
                     if set(x.casefold().split(','))
                           .issubset(words_comp)
                    ])

df['match'] = [matches(cust, comp) for cust, comp in
               zip(df['Customer_column'], df['Company_column'])]

# or
# df['match'] = df.apply(lambda r: matches(r['Customer_column'],
#                                          r['Company_column']),
#                        axis=1)

output:

NB. I added one more line to show what happens on multiple matches.

       Customer_column                          Company_column            match
0           (Oil|Leak)           (leak is present in radiator)             Leak
1    (Oil|Leak,engine)                        (oil is leaking)              Oil
2    (Oil|Leak,engine)       (there is a water mark at engine)                 
3  (Oil|Leak,radiator)        (water is available in radiator)                 
4    (Leak|Mark,water)       (there is a water mark at engine)       Mark,water
5    (Leak|Mark,water)  (there is a water leak mark at engine)  Leak Mark,water