Home > Software engineering >  Merging two Dataframes based on subject id
Merging two Dataframes based on subject id

Time:06-18

I have two dataframes defined as:

dataframe1 = pd.DataFrame( [[123, "ABC", "DEF", "GHI"],
                            [123, "ABC", "DEF", "GHI"],
                            [123, "ABC", "DEF", "GHI"],
                            [456, "JKL", "MNO", "PQR"],
                            [456, "JKL", "MNO", "PQR"],
                            [456, "JKL", "MNO", "PQR"]],
    columns=["ID",
            "Attr1", "Attr2", "Attr3"],
    )

dataframe2 = pd.DataFrame( [["1", "A", 123],
                            ["2", "A", 123],
                            ["1", "C", 456],
                            ["1", "B", 456],
                            ["2", "B", 123],
                            ["2", "C", 456]],
    columns=["ROW","LOCATION","ID"],
    )

I want to merge these two dataframe based on ID number. I have tried merge function as

pd.merge(dataframe1,dataframe2, on = 'ID')

However, that does not generate my desired result. I want an output like

    ID ROW LOCATION Attr1 Attr2 Attr3
0  123   1        A   ABC   DEF   GHI
1  123   2        A   ABC   DEF   GHI
2  123   2        B   ABC   DEF   GHI
3  456   1        B   JKL   MNO   PQR
4  456   1        C   JKL   MNO   PQR
5  456   2        C   JKL   MNO   PQR

CodePudding user response:

Is this what you are trying to get?

df_merge = pd.merge(dataframe1, dataframe2, on = ['ID']).drop_duplicates(ignore_index = True)
df_merge[['ID', 'ROW', 'LOCATION', 'Attr1', 'Attr2', 'Attr3']]

CodePudding user response:

To get your desired result, you can sort each dataframe (both on ID and dataframe2 on LOCATION and ROW as well) and then use concat to join them horizontally on their indexes, dropping the ID column from dataframe1:

df1 = dataframe1.sort_values(['ID']).reset_index(drop=True)
df2 = dataframe2.sort_values(['ID', 'LOCATION', 'ROW']).reset_index(drop=True)
pd.concat([df2,df1.drop('ID', axis=1)], axis=1)

Output:

  ROW LOCATION   ID Attr1 Attr2 Attr3
0   1        A  123   ABC   DEF   GHI
1   2        A  123   ABC   DEF   GHI
2   2        B  123   ABC   DEF   GHI
3   1        B  456   JKL   MNO   PQR
4   1        C  456   JKL   MNO   PQR
5   2        C  456   JKL   MNO   PQR

CodePudding user response:

Although the exact logic is unclear, it looks like you want a merge in order of the duplicates.

For this you can add a secondary key:

(dataframe1
 .assign(ID2=lambda d: d.groupby('ID').cumcount())
 .merge(dataframe2.assign(ID2=lambda d: d.groupby('ID').cumcount()),
       on=['ID', 'ID2'])
)

Output:

    ID Attr1 Attr2 Attr3  ID2 ROW LOCATION
0  123   ABC   DEF   GHI    0   1        A
1  123   ABC   DEF   GHI    1   2        A
2  123   ABC   DEF   GHI    2   2        B
3  456   JKL   MNO   PQR    0   1        C
4  456   JKL   MNO   PQR    1   1        B
5  456   JKL   MNO   PQR    2   2        C
  • Related