You need to map values from another data frame in 2 conditions-CodePudding

I need to do a substring search in a string by condition in the second column. I have 2 dataframes: df1 df2

(Step 1) For the first row in df1, the N_Product column is VALVE.

(Step 2) Look for VALVE in the N_Product column of each row df2 and find 3 matches with the following pairs (

df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('').

(Step 3) Then you need to check if M_Product contains the following values:

df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (DONE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] (PRESSURE),
df2 ['N_Product'] (VALVE) - df2 ['M_Product'] ('')

in df1 ['Descr'], if contained, you need to write N_Product ":" M_Product ";", if not , then just N_Product ';' .For 'VALVE' you need to look for df2 ["M_Product"] in df1 ['Descr'] only "DONE", "PRESSURE" and "", others are not needed, for N_Product ('GEEKU') - only "ELECTRICAL", "OVERBOARD" and "" (values), etc., depending on the values ('M_Product') that correspond to ('N_Product'), the values ('M_Product') that correspond to other values ('N_Product') to look for in Df1 ['Descr '] - do not need


    df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
            'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
            }
    df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
            'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
            }
    df1 = pd.DataFrame(df1)
    df2 = pd.DataFrame(df2)

desired result

I use this code, but it searches for df2 ['M_Product'] for all values, but only for those where df1 ['N_product'] == df2 ['N_Product'] / I would appreciate any help in resolving this problems

def foo(x):
    descr = x['Descr'].upper()
    match = None
    if x['N_Product'].upper() in list(df2['N_Product']):
        for mStr in df2['M_Product'].str.upper():
            if mStr in descr:
                match = mStr
                break
    if match is None:
        return x['N_Product']   ';'
    else:
        return x['N_Product']   ': '   match   ';'
df1['Result'] = df1.apply(foo, axis = 1)

I add a picture that visualizes what needs to be done, for example the value of df1 ['N_Product'] "Valve") similarly everything needs to be done for all values:

picture

CodePudding user response：

Based on the result you've described using the picture in your question, here is my understanding of what you're trying to do:

Each N_Product value has an associated list of M_Product values in df2.
Each N_Product value in df1 has a Descr value that is a csv list of the following format: N_Product followed by one or more compatible values of M_Product for this row.
Objective: Append a Result column to df1 containing the N_Product value n for each row with the first of the corresponding Descr M_Product values m such that (n, m) is found in df2.

Here is some code that I believe will do what you've asked:

import pandas as pd
df1 = {'Descr': ["VALVE, DONE", "pump ttf", "pump electrical", "Valve, ww","Geeku MBA , electrical","valve PRESSURE, OVERBOARD","VALVE, Electrical DONE","Geeku electrical OVERBOARD","Geeku OVERBOARD , electrical"],
            'N_Product': ["VALVE", "PUMP", "PUMP", "VALVE", "GEEKU","VALVE","VALVE", "GEEKU", "GEEKU"],
            }
df2 = {'N_Product': ["GEEKU","GEEKU","GEEKU", "PUMP", "PUMP","VALVE", "VALVE","VALVE"],
            'M_Product': ["ELECTRICAL", "OVERBOARD","", "TTF","", "DONE","PRESSURE",""],
            }
df1 = pd.DataFrame(df1).apply(lambda x: x.astype(str).str.upper())
df2 = pd.DataFrame(df2).apply(lambda x: x.astype(str).str.upper())

print('df1:'); print(df1)
print('df2:'); print(df2)

df1['M_Product'] = df1['Descr'].apply(lambda x: [val.strip(',') for val in x.split() if val.strip(',')]).str.slice(start=1)
df1['df1_row'] = df1.index

df3 = df1[['df1_row', 'N_Product', 'M_Product']].explode('M_Product')
df5 = df3.merge(df2, on=['N_Product', 'M_Product']).groupby('df1_row').nth(0)
df1['M_Product'] = df5['M_Product']

df1['Result'] = df1['N_Product']   (~df1['M_Product'].isna()) * (': '   df1['M_Product'].astype(str).str.strip())   ';'
df1 = df1.drop(columns=['M_Product', 'df1_row'])
print('result:'); print(df1)

Output:

df1:
                          Descr N_Product
0                   VALVE, DONE     VALVE
1                      PUMP TTF      PUMP
2               PUMP ELECTRICAL      PUMP
3                     VALVE, WW     VALVE
4        GEEKU MBA , ELECTRICAL     GEEKU
5     VALVE PRESSURE, OVERBOARD     VALVE
6        VALVE, ELECTRICAL DONE     VALVE
7    GEEKU ELECTRICAL OVERBOARD     GEEKU
8  GEEKU OVERBOARD , ELECTRICAL     GEEKU
df2:
  N_Product   M_Product
0     GEEKU  ELECTRICAL
1     GEEKU   OVERBOARD
2     GEEKU
3      PUMP         TTF
4      PUMP
5     VALVE        DONE
6     VALVE    PRESSURE
7     VALVE
result:
                          Descr N_Product              Result
0                   VALVE, DONE     VALVE        VALVE: DONE;
1                      PUMP TTF      PUMP          PUMP: TTF;
2               PUMP ELECTRICAL      PUMP               PUMP;
3                     VALVE, WW     VALVE              VALVE;
4        GEEKU MBA , ELECTRICAL     GEEKU  GEEKU: ELECTRICAL;
5     VALVE PRESSURE, OVERBOARD     VALVE    VALVE: PRESSURE;
6        VALVE, ELECTRICAL DONE     VALVE        VALVE: DONE;
7    GEEKU ELECTRICAL OVERBOARD     GEEKU  GEEKU: ELECTRICAL;
8  GEEKU OVERBOARD , ELECTRICAL     GEEKU  GEEKU: ELECTRICAL;

Explanation:

make everything in df1 and df2 uppercase to simplify matching
split Descr into tokens and except for the first one (which is just a duplicate of N_Product), put them into a list in df1 in a new column named M_Product
record the index of the original df1 rows in a column named df1_row
use explode() to create a dataframe df3 with one row for each value in the above M_Product column in df1
use merge() to select rows in df3 that match rows in df2 on (N_Product, M_Product)
use groupby() on df1_row and nth(0) to take the 0'th match for each such (N_Product, M_Product) pair
add the M_Product column from this new dataframe back into df1
use apply() to fill a new Result column in df1 with either (1) N_Product ; if the M_Product column is empty (isna()) or (2) N_Product ':' M_Product ';' if there was an M_Product match.
drop the intermediate columns we no longer need (M_Product, df1_row).