I need to do a substring search in a string by condition in the second column. I have 2 dataframes: df1 df2
(Step 1) For the first row in df1, the N_Product column is VALVE.
(Step 2) Look for VALVE in the N_Product column of each row df2 and find 3 matches with the following pairs (
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('').
(Step 3) Then you need to check if M_Product contains the following values:
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('')
in df1 ['Descr'], if contained, you need to write N_Product ":" M_Product ";", if not , then just N_Product ';' .For 'VALVE' you need to look for df2 ["M_Product"] in df1 ['Descr'] only "DONE", "PRESSURE" and "", others are not needed, for N_Product ('GEEKU') - only "ELECTRICAL", "OVERBOARD" and "" (values), etc., depending on the values ('M_Product') that correspond to ('N_Product'), the values ('M_Product') that correspond to other values ('N_Product') to look for in Df1 ['Descr '] - do not need
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
I use this code, but it searches for df2 ['M_Product'] for all values, but only for those where df1 ['N_product'] == df2 ['N_Product'] / I would appreciate any help in resolving this problems
def foo(x):
descr = x['Descr'].upper()
match = None
if x['N_Product'].upper() in list(df2['N_Product']):
for mStr in df2['M_Product'].str.upper():
if mStr in descr:
match = mStr
break
if match is None:
return x['N_Product'] ';'
else:
return x['N_Product'] ': ' match ';'
df1['Result'] = df1.apply(foo, axis = 1)
I add a picture that visualizes what needs to be done, for example the value of df1 ['N_Product'] "Valve") similarly everything needs to be done for all values:
CodePudding user response:
Based on the result you've described using the picture in your question, here is my understanding of what you're trying to do:
- Each N_Product value has an associated list of M_Product values in df2.
- Each N_Product value in df1 has a Descr value that is a csv list of the following format: N_Product followed by one or more compatible values of M_Product for this row.
- Objective: Append a Result column to df1 containing the N_Product value n for each row with the first of the corresponding Descr M_Product values m such that (n, m) is found in df2.
Here is some code that I believe will do what you've asked:
import pandas as pd
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
}
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
}
df1 = pd.DataFrame(df1).apply(lambda x: x.astype(str).str.upper())
df2 = pd.DataFrame(df2).apply(lambda x: x.astype(str).str.upper())
print('df1:'); print(df1)
print('df2:'); print(df2)
df1['M_Product'] = df1['Descr'].apply(lambda x: [val.strip(',') for val in x.split() if val.strip(',')]).str.slice(start=1)
df1['df1_row'] = df1.index
df3 = df1[['df1_row', 'N_Product', 'M_Product']].explode('M_Product')
df5 = df3.merge(df2, on=['N_Product', 'M_Product']).groupby('df1_row').nth(0)
df1['M_Product'] = df5['M_Product']
df1['Result'] = df1['N_Product'] (~df1['M_Product'].isna()) * (': ' df1['M_Product'].astype(str).str.strip()) ';'
df1 = df1.drop(columns=['M_Product', 'df1_row'])
print('result:'); print(df1)
Output:
df1:
Descr N_Product
0 VALVE, DONE VALVE
1 PUMP TTF PUMP
2 PUMP ELECTRICAL PUMP
3 VALVE, WW VALVE
4 GEEKU MBA , ELECTRICAL GEEKU
5 VALVE PRESSURE, OVERBOARD VALVE
6 VALVE, ELECTRICAL DONE VALVE
7 GEEKU ELECTRICAL OVERBOARD GEEKU
8 GEEKU OVERBOARD , ELECTRICAL GEEKU
df2:
N_Product M_Product
0 GEEKU ELECTRICAL
1 GEEKU OVERBOARD
2 GEEKU
3 PUMP TTF
4 PUMP
5 VALVE DONE
6 VALVE PRESSURE
7 VALVE
result:
Descr N_Product Result
0 VALVE, DONE VALVE VALVE: DONE;
1 PUMP TTF PUMP PUMP: TTF;
2 PUMP ELECTRICAL PUMP PUMP;
3 VALVE, WW VALVE VALVE;
4 GEEKU MBA , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
5 VALVE PRESSURE, OVERBOARD VALVE VALVE: PRESSURE;
6 VALVE, ELECTRICAL DONE VALVE VALVE: DONE;
7 GEEKU ELECTRICAL OVERBOARD GEEKU GEEKU: ELECTRICAL;
8 GEEKU OVERBOARD , ELECTRICAL GEEKU GEEKU: ELECTRICAL;
Explanation:
- make everything in df1 and df2 uppercase to simplify matching
- split Descr into tokens and except for the first one (which is just a duplicate of N_Product), put them into a list in df1 in a new column named
M_Product
- record the index of the original df1 rows in a column named
df1_row
- use
explode()
to create a dataframe df3 with one row for each value in the aboveM_Product
column in df1 - use
merge()
to select rows in df3 that match rows in df2 on (N_Product, M_Product) - use
groupby()
ondf1_row
andnth(0)
to take the 0'th match for each such (N_Product, M_Product) pair - add the M_Product column from this new dataframe back into df1
- use
apply()
to fill a newResult
column in df1 with either (1) N_Product;
if the M_Product column is empty (isna()
) or (2) N_Product ':' M_Product ';' if there was an M_Product match. - drop the intermediate columns we no longer need (M_Product, df1_row).