Home > database >  Pandas: Match values from two dataframes - Many to One
Pandas: Match values from two dataframes - Many to One

Time:04-23

I have two dataframes I need to match by row. Where a match occurs I need to increment the value 1 in of a field in df1. df2 has mulitple matches to df1. I don't want to merge the dataframes, just update df1 based off a match to df2.

The basic logic in my head is read the first row of df1, then try to match TRANID to each row of df2. When a match occurs, add 1 to the NUMINSTS value. Then loop back and do the same for the next row on df1. I'm just not sure how to approach this in Python/Pandas.

I'm an old COBOL programmer and am just learning Python/Pandas so any help is greatly appreciated.

    Input Data
    
    df1:
    
    TRANID      NUMINSTS
    60000022    22
    60000333    6
    70000001    15
    70000233    60
    
    df2:

    TRANID
    60000333
    70000233
    70000233

    Output

    df3:

    TRANID      NUMINSTS
    60000022    22
    60000333    7   #incremented by 1
    70000001    15
    70000233    62  #incremented by 2

CodePudding user response:

We can filter based on the values in df2 and keep adding or changing values in df1.

import pandas as pd

df1 = pd.DataFrame({"TRAINID":["60000022", "60000333", "70000001", "70000233"], "NUMINSTS":[22,6,15,60]})
df2 = pd.DataFrame({"TRAINID":[ "60000333", "70000233", "70000233"]})

def add_num(df1,df2):
    for id in list(df2["TRAINID"]):
        df1.loc[df1["TRAINID"] == id, "NUMINSTS"]  = 1
    return df1

df3 = add_num(df1,df2)
print(df3)

CodePudding user response:

You want two cases:

  1. Tranid exists in df1
  2. Tranid doesn't exist in df1.

Here is your code:

import pandas as pd
df1=pd.DataFrame({'tranid':[1,2,3],'numinst':[2,4,6]})
df2=pd.DataFrame({'tranid':[1,2,4]})
tranvalues=df1['tranid']
for i in range(len(df2)):
    if df2['tranid'][i] in tranvalues:
        df1['numinst'][df1['tranid']==df2['tranid'][i]]=df1['numinst'] 1
    else:
        df1.loc[len(df1.index)]=[df2['tranid'][i],1]

CodePudding user response:

You may try:

    df1 = pd.DataFrame({'TRANID':[60000022, 60000333, 70000001, 70000233],
'NUMINSTS':[22,6,15,60]})

df1:

    TRANID  NUMINSTS
0   60000022          22
1   60000333           6
2   70000001          15
3   70000233          60

df2 = pd.DataFrame({'TRANID':[60000333, 70000233, 70000233]})

df2:

    TRANID
0   60000333
1   70000233
2   70000233

Build a dictionary of counts of TRANID values from df2:

d = df2['TRANID'].value_counts().to_dict()

Copy df3 from df1 and update the NUMINSTS column like if the TRANID is in the above dictionary , increment by the value count otherwise keep it the same:

df3 = df1.copy()


df3['NUMINSTS'] = df3.apply(
        lambda row: 
    row['NUMINSTS'] d[row['TRANID']] if row['TRANID'] in d else row['NUMINSTS'], axis=1)

If you don't want the rows that don't match, you could replace None like below and then drop those with None values:

df3['NUMINSTS'] = df3.apply(
            lambda row: 
        row['NUMINSTS'] d[row['TRANID']] if row['TRANID'] in d else None, axis=1)
df3.dropna(subset=['NUMINSTS'], inplace=True)
df3['NUMINSTS'] = df3['NUMINSTS'].astype(int)
df3.reset_index(inplace=True,drop=True)

Output df3:

    TRANID    NUMINSTS
0   60000333         7
1   70000233        62
  • Related