how to conditionally (value in column) search for substrings in another column in python loop-CodePudding

I need to do a substring search in a string by condition in the second column. I have 2 dataframes:

df1 = {'Descr': ["VALVE, PRESSURE", "pump ttf", "Valve, electrical", "Geeku, electrical","VALVE, OVERBOARD, BUTTERFLY"],
        'N_Product': ["VALVE", "PUMP", "VALVE", "GEEKU","VALVE"],
        }
df2 = {'N_Product': ["VALVE", "VALVE","VALVE", "PUMP", "GEEKU"],
        'M_Product': ["PRESSURE", "ELECTRICAL","", "", "ELECTRICAL"],
        }
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

you need to compare the values from the N_Product columns from the two dataframes, if they match, then search for the corresponding values from the M_Product column (for example, for Valve - there are 3 matches: "PRESSURE", "ELECTRICAL" and "" (empty value) in the Descr column, if it is there, then add it to the Result column. To search for substrings in a string, I use the following code:

c = df2['M_Product'].astype(str).to_list()
def matcher(x):
    for i in c:
        if i.lower() in x.lower():
            return i
    else:
        return np.nan
df1['Res'] = df1['Descr'].apply(matcher)

but I don't know how to cycle through the values of only the corresponding M_Product for N_Product.

Desired result:

df1 = {'Descr': ["VALVE, PRESSURE", "pump ttf", "Valve, electrical", "Geeku, electrical","VALVE, OVERBOARD, BUTTERFLY"],
        'N_Product': ["VALVE", "PUMP", "VALVE", "GEEKU","VALVE"],
        },
'Result': ["VALVE: PRESSURE;", "PUMP", "VALVE: ELECTRICAL;", "GEEKU: ELECTRICAL;","VALVE;"],
        }

I would be grateful for any help)

CodePudding user response：

(UPDATED)

Based on the updated question, my understanding of what's being asked is this:

Create a new Result column
If the N_Product columns in df1 and df2 for a given row match, then append to the value from the N_Product column in df1 the first match in column M_Product of df2 of a string found in the given row's Descr column in df1 (with an intervening : character).
Otherwise, put N_Product from df1 in the Result column.
Also append a ; character to what is put in Result.

Here is a way to do that:

def foo(x):
    descr = x['Descr'].upper()
    match = None
    for mStr in df2['M_Product'].str.upper():
        if mStr in descr:
            match = mStr
            break
    if match is None:
        return x['N_Product']   ';'
    else:
        return x['N_Product']   ': '   match   ';'
mask = df1['N_Product'] == df2['N_Product']
df1.loc[mask, 'Result'] = df1.apply(foo, axis = 1)
df1.loc[~mask, 'Result'] = df1['N_Product']   ';'

Explanation:

Create a boolean Series mask that is True for rows of df1 with N_Product matching the corresponding value in df2.
For rows in df1 where mask is True, use apply to call foo which performs the logic of identifying the first value (if any) in the M_Product column of df2 that is found in a given row's Descr column and packaging it in a string of the form N_Product: M_Product; if found, otherwise just N_Product;.
For rows in df1 where mask is False (namely: ~mask), set the Result column to be N_Product;.

Input:

df1:
                         Descr N_Product
0              VALVE, PRESSURE     VALVE
1                     pump ttf      PUMP
2            Valve, electrical     VALVE
3            Geeku, electrical     GEEKU
4  VALVE, OVERBOARD, BUTTERFLY     VALVE

df2:
  N_Product   M_Product
0     VALVE    PRESSURE
1     VALVE  ELECTRICAL
2     VALVE
3      PUMP
4     GEEKU         MBA

Output:

                         Descr N_Product              Result
0              VALVE, PRESSURE     VALVE    VALVE: PRESSURE;
1                     pump ttf      PUMP               PUMP;
2            Valve, electrical     VALVE  VALVE: ELECTRICAL;
3            Geeku, electrical     GEEKU              GEEKU;
4  VALVE, OVERBOARD, BUTTERFLY     VALVE              VALVE;