I need to do a substring search in a string by condition in the second column. I have 2 dataframes:
df1 = {'Descr': ["VALVE, PRESSURE", "pump ttf", "Valve, electrical", "Geeku, electrical","VALVE, OVERBOARD, BUTTERFLY"],
'N_Product': ["VALVE", "PUMP", "VALVE", "GEEKU","VALVE"],
}
df2 = {'N_Product': ["VALVE", "VALVE","VALVE", "PUMP", "GEEKU"],
'M_Product': ["PRESSURE", "ELECTRICAL","", "", "ELECTRICAL"],
}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
you need to compare the values from the N_Product columns from the two dataframes, if they match, then search for the corresponding values from the M_Product column (for example, for Valve - there are 3 matches: "PRESSURE", "ELECTRICAL" and "" (empty value) in the Descr column, if it is there, then add it to the Result column. To search for substrings in a string, I use the following code:
c = df2['M_Product'].astype(str).to_list()
def matcher(x):
for i in c:
if i.lower() in x.lower():
return i
else:
return np.nan
df1['Res'] = df1['Descr'].apply(matcher)
but I don't know how to cycle through the values of only the corresponding M_Product for N_Product.
Desired result:
df1 = {'Descr': ["VALVE, PRESSURE", "pump ttf", "Valve, electrical", "Geeku, electrical","VALVE, OVERBOARD, BUTTERFLY"],
'N_Product': ["VALVE", "PUMP", "VALVE", "GEEKU","VALVE"],
},
'Result': ["VALVE: PRESSURE;", "PUMP", "VALVE: ELECTRICAL;", "GEEKU: ELECTRICAL;","VALVE;"],
}
I would be grateful for any help)
CodePudding user response:
(UPDATED)
Based on the updated question, my understanding of what's being asked is this:
- Create a new
Result
column - If the
N_Product
columns in df1 and df2 for a given row match, then append to the value from theN_Product
column in df1 the first match in columnM_Product
of df2 of a string found in the given row'sDescr
column in df1 (with an intervening:
character). - Otherwise, put
N_Product
from df1 in theResult
column. - Also append a
;
character to what is put inResult
.
Here is a way to do that:
def foo(x):
descr = x['Descr'].upper()
match = None
for mStr in df2['M_Product'].str.upper():
if mStr in descr:
match = mStr
break
if match is None:
return x['N_Product'] ';'
else:
return x['N_Product'] ': ' match ';'
mask = df1['N_Product'] == df2['N_Product']
df1.loc[mask, 'Result'] = df1.apply(foo, axis = 1)
df1.loc[~mask, 'Result'] = df1['N_Product'] ';'
Explanation:
- Create a boolean Series
mask
that is True for rows of df1 withN_Product
matching the corresponding value in df2. - For rows in df1 where
mask
is True, useapply
to callfoo
which performs the logic of identifying the first value (if any) in theM_Product
column of df2 that is found in a given row'sDescr
column and packaging it in a string of the formN_Product: M_Product;
if found, otherwise justN_Product;
. - For rows in df1 where
mask
is False (namely:~mask
), set theResult
column to beN_Product;
.
Input:
df1:
Descr N_Product
0 VALVE, PRESSURE VALVE
1 pump ttf PUMP
2 Valve, electrical VALVE
3 Geeku, electrical GEEKU
4 VALVE, OVERBOARD, BUTTERFLY VALVE
df2:
N_Product M_Product
0 VALVE PRESSURE
1 VALVE ELECTRICAL
2 VALVE
3 PUMP
4 GEEKU MBA
Output:
Descr N_Product Result
0 VALVE, PRESSURE VALVE VALVE: PRESSURE;
1 pump ttf PUMP PUMP;
2 Valve, electrical VALVE VALVE: ELECTRICAL;
3 Geeku, electrical GEEKU GEEKU;
4 VALVE, OVERBOARD, BUTTERFLY VALVE VALVE;