Home > OS >  Pandas: Merge values from one dataframe to another based on condition
Pandas: Merge values from one dataframe to another based on condition

Time:12-17

using fuzzy logic and fuzzywuzzy module I am able to match Names(from one dataframe) with Short Names(from another Dataframe). Both these Dataframes also contain a table ISIN.

This is the dataframe I get after logic is applied.

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e 05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e 04    0.0002         NaN                   NaN
238   NaN             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e 04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e 07    0.0499         NaN                   NaN
240   NaN                                       AUM      NaN  4.338766e 08    0.9999         NaN                   NaN  AUM IND BARC US

A new column 'matches' is created which basically implies that Short name from 2nd dataframe matches Name from the first dataframe.

ISIN from dataframe1 is empty and ISIN from dataframe2 is present. Upon a subsequent Match(Name from 1st Dataframe and Short Name from 2nd Dataframe), I want to add the relevant ISIN from 2nd dataframe to 1st dataframe.

How do I get the ISIN from 2nd dataframe to the 1st dataframe so that my final output would look like this?

ISIN                                      Name Currency         Value  % Weight  Asset Type Comments/ Assumptions          matches
236   NaN            Partnerre Ltd 4.875% Perp Sr:J      USD  1.684069e 05    0.0004         NaN                   NaN
237   NaN  Berkley (Wr) Corporation 5.700% 03/30/58      USD  6.955837e 04    0.0002         NaN                   NaN
238   78s9             Tc Energy Corp Flt Perp Sr:11      USD  6.380262e 04    0.0001         NaN                   NaN   TC ENERGY CORP
239   NaN                      Cash and Equivalents      USD  2.166579e 07    0.0499         NaN                   NaN
240   123e                                       AUM      NaN  4.338766e 08    0.9999         NaN                   NaN  AUM IND BARC US

EDIT: dataframes and their in their original form df1

ISIN                                 Name Currency       Value  % Weight  Asset Type                              Comments/ Assumptions
0   NaN     Transcanada Trust 5.875 08/15/76      USD  7616765.00    0.0176         NaN  https://assets.cohenandsteers.com/assets/conte...
1   NaN      Bp Capital Markets Plc Flt Perp      USD  7348570.50    0.0169         NaN  Holding value for each constituent is derived ...
2   NaN       Transcanada Trust Flt 09/15/79      USD  7341250.00    0.0169         NaN                                                NaN
3   NaN      Bp Capital Markets Plc Flt Perp      USD  6734022.32    0.0155         NaN                                                NaN
4   NaN  Prudential Financial 5.375% 5/15/45      USD  6508290.68    0.0150         NaN                                                NaN
(241, 7)

df2

Short Name          ISIN
0  ABU DHABI COMMER  AEA000201011
1  ABU DHABI NATION  AEA002401015
2  ABU DHABI NATION  AEA006101017
3  ADNOC DRILLING C  AEA007301012
4  ALPHA DHABI HOLD  AEA007601015
(66987, 2)

EDIT 2: the fuzzy logic to get matches from the dataframes

df1 = pd.read_excel('file.xlsx', sheet_name=1, usecols=[1, 2, 3, 4, 5, 6, 8], header=1)
df2 = pd.read_excel("Excel files/file2.xlsx", sheet_name=0, usecols=[1, 2], header=1)

# empty lists for storing the matches
# later
mat1 = []
mat2 = []
p = []

# converting dataframe column
# to list of elements
# to do fuzzy matching
list1 = df1['Name'].tolist()
list2 = df2['Short Name'].tolist()

# taking the threshold as 80
threshold = 93

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extractOne(i, list2, scorer=fuzz.token_set_ratio))
df1['matches'] = mat1

# iterating through the closest matches
# to filter out the maximum closest match
for j in df1['matches']:
    if j[1] >= threshold:
        p.append(j[0])
    mat2.append(",".join(p))
    p = []

# storing the resultant matches back
# to df1
df1['matches'] = mat2
print("\nDataFrame after Fuzzy matching using token_set_ratio():")
#print(df1.to_csv('todays-result1.csv'))
print(df1.head(20))

CodePudding user response:

Assuming your first dataframe has ISINs filled out to null, then a simple merge will do what you need. If you need the non-null ISINs in the first dataframe to be preserved, then you need to use a boolean mask:-

df1 = pd.DataFrame(
  [[None, "Apple", "appl"], 
  [None, "Google", "ggl"], 
  [None, "Amazon", 'amzn']], 
  columns=["ISIN", "Name", "matches"]
)

df2 = pd.DataFrame(
  [["ISIN1", "appl"], 
  ["ISIN2", "ggl"]], 
  columns= ["ISIN", "Short Name"]
)

missing_isin = df1['ISIN'].isnull()

df1.loc[missing_isin, 'ISIN'] = df1.loc[missing_isin][['matches']].merge(
    df2[['ISIN', 'Short Name']], 
    how='left', 
    left_on='matches', 
    right_on='Short Name'
)['ISIN']

left_on / right_on :- Column names to match the dataframes on

how='left' :- (In simple terms) Preserves the order/index of the leftmost dataframe, check out the docs for more info

  • Related