Home > Enterprise >  pandas, how to merge tow dataframe with multiple columns IDs
pandas, how to merge tow dataframe with multiple columns IDs

Time:11-07

Sorry, I am new with pandas and I'm struggling. Basically, I have two dataset df1:

sites nb samples
A 3
B 2
C 1

df2:

sites1 sites2 nb links
A B 3
A C 1

What I want to get is output:

sites1 sites2 nb links nb samples sites1 nb samples sites2
A B 3 3 2
A C 1 3 1

I tried several things like:

print(pd.merge(df2,df1 ,left_on='sites1',right_on=['sites']))

But iam far away from the result I want to get.

Many thanks in advance for some helps.

CodePudding user response:

A one-liner using pandas.DataFrame.merge and .rename()

df_new = df2.merge(df1, left_on='sites1', right_on='sites', how='left').merge(df1, left_on='sites2', right_on='sites', how='left')[['sites1', 'sites2', 'nb links', 'nb samples_x', 'nb samples_y']].rename(columns={'nb samples_x': 'nb samples sites1', 'nb samples_y': 'nb samples sites2'})

[Out]:

  sites1 sites2  nb links  nb samples sites1  nb samples sites2
0      A      B         3                  3                  2
1      A      C         1                  3                  1

Notes

  • Let us break down to make it easier to understand:

    1. One starts by merging both dataframes

      df_new = df2.merge(df1, left_on='sites1', right_on='sites', how='left').merge(df1, left_on='sites2', right_on='sites', how='left')
      
      [Out]:
      
        sites1 sites2  nb links sites_x  nb samples_x sites_y  nb samples_y
      0      A      B         3       A             3       B             2
      1      A      C         1       A             3       C             1
      
    2. Selecting only the columns one wants to consider

      df_new = df_new[['sites1', 'sites2', 'nb links', 'nb samples_x', 'nb samples_y']]
      
      [Out]:
      
        sites1 sites2  nb links  nb samples_x  nb samples_y
      0      A      B         3             3             2
      1      A      C         1             3             1
      
    3. Renaming the columns

      df_new.columns = ['sites1', 'sites2', 'nb links', 'nb samples sites1', 'nb samples sites2']
      
      [Out]:
      
        sites1 sites2  nb links  nb samples sites1  nb samples sites2
      0      A      B         3                  3                  2
      1      A      C         1                  3                  1
      
  • Related