I have two dataframe
One with codes and values need to map to other dataframe
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
Other datframe content lots of data with values and need to make a new column base on dataframe "B" column "Code".
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
A.loc[:,'NewName'] = A['Name']
So I'm using the below code to replace A['NewName'].
A['NewName']= A['NewName'].replace(B.set_index('Value')['Code'])
Test Name Value NewName
0 2.0000 House with indoor pool 50 a
1 34.0000 House with Gray_C_Door 100 b
2 12.0000 House with indoor pool and Porch 70 House with indoor pool and Porch
3 45.0000 Wood Raling 20 b
4 NaN Window Glass Tinted 15 Window Glass Tinted
5 34.0000 Windows Glass_with 75 Windows Glass_with
6 56.0000 Big Chandelier 50 c
7 23.0000 Frame Window 10 e
8 56.0000 NaN 10 NaN
9 87.0000 Window glass 34 Window glass
10 23.0000 House with indoor pool 5 a
11 67.0000 High column with 56 High column with
12 89.0000 Window Glass trim 12 c
13 123.000 Frame Window 83 e
14 NaN glass Window NaN glass Window
However, some A['NewName'] are not matching with B['Value'] and doesn't give an exact expected outcome.
Is there a way, I can match those values when It has partial matching with A['NewName'] and give the correct code? I mean for instance when A['NewName'] has "House with indoor pool and Porch" I want to match it with B['Value'] = 'House with indoor pool' and replace it with correct B['Code] = 'a'. I couldn't add that to the data frame B Values column because there are several ways it could change after "House with indoor pool" (for ex: "House with indoor pool_ with big glass door", "House with indoor pool and High railings" etc.)
Is this possible to do it in a map/replace function or any other method?
Thanks in advacne!
CodePudding user response:
Here's a way to do it using difflib
. We effectively do the outer product of comparisons of Name
and Value
for which there's no exact match, to find the best fit (according to difflib
) before making a selection.
To flag really poor matches, we can set a threshold for the difflib
match ratio below which we return NaN. I have selected 0.5 in the code below, which catches the sample input Name
"I don't match anything".
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan,333],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window","I don't match anything"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan,'999']})
from difflib import SequenceMatcher
bValues = B['Value'].to_list()
bValuesDict = {value : idx for idx, value in enumerate(bValues)}
columnCode = B.columns.to_list().index('Code')
def getCode(x):
name = x['Name']
if not isinstance(name, str):
return np.nan
if name in bValuesDict:
idx = bValuesDict[name]
else:
bMatches = [SequenceMatcher(None, name, value).ratio() for value in bValues]
maxRatio = max(bMatches)
if maxRatio < 0.5:
return np.nan
idx = bMatches.index(maxRatio)
code = B.iloc[idx, columnCode]
return code
A['NewName'] = A.apply(getCode, axis=1)
print(A)
Output:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 c
5 34.0 Windows Glass_with 75 a
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 a
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 b
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN e
15 333.0 I don't match anything 999 NaN
CodePudding user response:
You can use re
(regular expression) in your own function and apply this function to A['Name']
(btw initializing 'Newname' is useless here):
import re
import pandas as pd
import numpy as np
B = pd.DataFrame({'Code': ['a', 'b', 'c', 'a', 'e','b','b','c'],
'Value': ["House with indoor pool", "House with Gray_C_Door", "Big Chandelier",
"Window Glass", "Frame Window",'High Column','Wood Raling', 'Window Glass trim']})
A = pd.DataFrame({'Test': [2,34,12,45,np.nan,34,56,23,56,87,23,67,89,123,np.nan],
'Name': [ "House with indoor pool","House with Gray_C_Door",'House with indoor pool and Porch',"Wood Raling",
'Window Glass Tinted',"Windows Glass_with",'Big Chandelier', "Frame Window",np.nan,"Window glass","House with indoor pool",'High column with',
"Window Glass trim",'Frame Window',"glass Window"],
'Value': ["50", "100", "70", "20", "15",'75','50',"10", "10", "34", "5", "56",'12','83',np.nan]})
B_series = B.set_index('Value')
def get_code_for_matching_value(val):
if type(val)==str:
if val in B_series.index:
return B_series.at[val, 'Code']
else:
for i in B_series.index:
if re.match(i, val):
return B_series.at[i, 'Code']
return val
A['NewName']= A['Name'].apply(get_code_for_matching_value)
print(A)
Output:
Test Name Value NewName
0 2.0 House with indoor pool 50 a
1 34.0 House with Gray_C_Door 100 b
2 12.0 House with indoor pool and Porch 70 a
3 45.0 Wood Raling 20 b
4 NaN Window Glass Tinted 15 a
5 34.0 Windows Glass_with 75 Windows Glass_with
6 56.0 Big Chandelier 50 c
7 23.0 Frame Window 10 e
8 56.0 NaN 10 NaN
9 87.0 Window glass 34 Window glass
10 23.0 House with indoor pool 5 a
11 67.0 High column with 56 High column with
12 89.0 Window Glass trim 12 c
13 123.0 Frame Window 83 e
14 NaN glass Window NaN glass Window
Note: you can improve matching by making it case insensitive (for example): if re.match(i.lower(), val.lower()):