Map two dataframe base on a column and create a new column. Also match partial matching-CodePudding

I have two dataframe

One with codes and values need to map to other dataframe

B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})

Other datframe content lots of data with values and need to make a new column base on dataframe "B" column "Code".

A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']

So I'm using the below code to replace A['NewName'].

A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])

    Test    Name                                Value   NewName
0   2.0000  House with indoor pool              50      a
1   34.0000 House with Gray_C_Door              100     b
2   12.0000 House with indoor pool and Porch    70      House with indoor pool and Porch
3   45.0000 Wood Raling                         20      b
4   NaN     Window Glass Tinted                 15      Window Glass Tinted
5   34.0000 Windows Glass_with                  75      Windows Glass_with
6   56.0000 Big Chandelier                      50      c
7   23.0000 Frame Window                        10      e
8   56.0000 NaN                                 10      NaN
9   87.0000 Window glass                        34      Window glass
10  23.0000 House with indoor pool              5       a
11  67.0000 High column with                    56      High column with
12  89.0000 Window Glass trim                   12      c
13  123.000 Frame Window                        83      e
14  NaN     glass Window                        NaN     glass Window

However, some A['NewName'] are not matching with B['Value'] and doesn't give an exact expected outcome.

Is there a way, I can match those values when It has partial matching with A['NewName'] and give the correct code? I mean for instance when A['NewName'] has "House with indoor pool and Porch" I want to match it with B['Value'] = 'House with indoor pool' and replace it with correct B['Code] = 'a'. I couldn't add that to the data frame B Values column because there are several ways it could change after "House with indoor pool" (for ex: "House with indoor pool_ with big glass door", "House with indoor pool and High railings" etc.)

Is this possible to do it in a map/replace function or any other method?

Thanks in advacne!

CodePudding user response：

Here's a way to do it using difflib. We effectively do the outer product of comparisons of Name and Value for which there's no exact match, to find the best fit (according to difflib) before making a selection.

To flag really poor matches, we can set a threshold for the difflib match ratio below which we return NaN. I have selected 0.5 in the code below, which catches the sample input Name "I don't match anything".

import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})

from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
    name = x['Name']
    if not isinstance(name, str):
        return np.nan
    if name in bValuesDict: 
        idx = bValuesDict[name]
    else:
        bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
        maxRatio = max(bMatches)
        if maxRatio < 0.5:
            return np.nan
        idx = bMatches.index(maxRatio)
    code = B.iloc[idx, columnCode]
    return code
A['NewName'] = A.apply(getCode, axis=1)

print(A)

Output:

     Test                              Name Value NewName
0     2.0            House with indoor pool    50       a
1    34.0            House with Gray_C_Door   100       b
2    12.0  House with indoor pool and Porch    70       a
3    45.0                       Wood Raling    20       b
4     NaN               Window Glass Tinted    15       c
5    34.0                Windows Glass_with    75       a
6    56.0                    Big Chandelier    50       c
7    23.0                      Frame Window    10       e
8    56.0                               NaN    10     NaN
9    87.0                      Window glass    34       a
10   23.0            House with indoor pool     5       a
11   67.0                  High column with    56       b
12   89.0                 Window Glass trim    12       c
13  123.0                      Frame Window    83       e
14    NaN                      glass Window   NaN       e
15  333.0            I don't match anything   999     NaN

CodePudding user response：

You can use re (regular expression) in your own function and apply this function to A['Name'] (btw initializing 'Newname' is useless here):

import re
import pandas as pd
import numpy as np

B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
                  'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
                            "Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})

A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
                  'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
                           'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
                           "Window Glass trim",'Frame Window',"glass Window"],
                 'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})

B_series = B.set_index('Value')

def get_code_for_matching_value(val):
    if type(val)==str:
        if val in B_series.index:
            return B_series.at[val, 'Code']
        else:
            for i in B_series.index:
                if re.match(i, val):
                    return B_series.at[i, 'Code']
    return val
        
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)

Output:

     Test                              Name Value             NewName
0     2.0            House with indoor pool    50                   a
1    34.0            House with Gray_C_Door   100                   b
2    12.0  House with indoor pool and Porch    70                   a
3    45.0                       Wood Raling    20                   b
4     NaN               Window Glass Tinted    15                   a
5    34.0                Windows Glass_with    75  Windows Glass_with
6    56.0                    Big Chandelier    50                   c
7    23.0                      Frame Window    10                   e
8    56.0                               NaN    10                 NaN
9    87.0                      Window glass    34        Window glass
10   23.0            House with indoor pool     5                   a
11   67.0                  High column with    56    High column with
12   89.0                 Window Glass trim    12                   c
13  123.0                      Frame Window    83                   e
14    NaN                      glass Window   NaN        glass Window

Note: you can improve matching by making it case insensitive (for example): if re.match(i.lower(), val.lower()):