Home > front end >  Join/Merge two or more pandas dataframes which have 4 columns in common
Join/Merge two or more pandas dataframes which have 4 columns in common

Time:04-15

I know this question might seem repetitive at first, but the truth is that it is not, since I cannot find another answer or similar question whose solution works for me.

I am working with pandas dataframes, using Python language.

Suppose we have 3 datasets, A B and C.

  1. B and C are sub-datasets of A (dataframe A was splitted in two according to a binary column value)
  2. The datasets are of different lenghts (A is the biggest, B and C are smaller since they are sub-datasets of A)
  3. Each dataset has 5 columns: a, b, c, d and e (they have the same column names)
  4. Columns a, b, c and d (taken all together) do not have any repetitions (they could be a composed key of a database)
  5. Column e is different for each dataset
  6. Each dataset has a different index from the others (they were NOT generated using pandas.loc, I am dealing with them "already built and taken from outside python")

My question is: how can I put all these together without losing any row and by pairing them correcty without using the index?

Here's an example:

* Content of A *:
a   b   c   d   e
"x" "y" 0   1   0.99        # 0
"x" "y" 1   1   0.43        # 1
"x" "z" 0   0   0.90        # 2
"y" "z" 0   1   0.11        # 3
"x" "z" 0   1   0.78        # 4

* Content of B *:
a   b   c   d   e
"x" "y" 0   1   0.12        # 0 of dataframe A
"x" "z" 0   0   0.01        # 2 of dataframe A
"y" "z" 0   1   0.45        # 3 of dataframe A

* Content of C *:
a   b   c   d   e
"x" "y" 1   1   0.06        # 1 of dataframe A
"x" "z" 0   0   0.65        # 2 of dataframe A
"x" "z" 0   1   0.20        # 4 of dataframe A

I would like to obtain this output:

* Content of new_df *:
a   b   c   d   e_A   e_B   e_C
"x" "y" 0   1   0.99  0.12  NaN
"x" "y" 1   1   0.43  NaN   0.06
"x" "z" 0   0   0.90  0.01  0.65
"y" "z" 0   1   0.11  0.45  NaN
"x" "z" 0   1   0.78  NaN   0.20

My first trial for the code was the following line, but it deleted the rows which did not have all the three values (instead, I need to insert NaN).

new_df1 = pd.merge(A, B, how='left', left_on=["a", "b", "c", "d"]
new_df2 = pd.merge(new_df1, C, how='left', left_on=["a", "b", "c", "d"]

How can I achieve my objective of getting a full dataset made of all columns (5) and all rows (A contains the maximum amount of rows)?

reproducible input:

A = pd.DataFrame({'a': ['x', 'x', 'x', 'y', 'x'],
                  'b': ['y', 'y', 'z', 'z', 'z'],
                  'c': [0, 1, 0, 0, 0],
                  'd': [1, 1, 0, 1, 1],
                  'e': [0.99, 0.43, 0.9, 0.11, 0.78]})

B = pd.DataFrame({'a': ['x', 'x', 'y'],
                  'b': ['y', 'z', 'z'],
                  'c': [0, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.12, 0.01, 0.45]})

C = pd.DataFrame({'a': ['x', 'x', 'x'],
                  'b': ['y', 'z', 'z'],
                  'c': [1, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.06, 0.65, 0.2]})

CodePudding user response:

This seems to work for me.

import pandas as pd

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.read_csv('c.csv')

e = a.merge(b, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'], suffixes=['_A', '_B'])
e = e.merge(c, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'])

e = e.rename(columns={'e': 'e_C'})

print(e.head())

A bit explanation, pd.DataFrame.merge() currently supports explicitly state the joining columns doc. So using that we can merge it.

And suffix parameter lets you rename the overlapping columns like by adding provided suffix for both df (left_df_suffix, right_df_suffix).

Rest is as you have tried. I renamed the e column of c.csv (or DF) after merging. Hope this helps.

CodePudding user response:

Combine the keys which are common in a string as shown in the code below.

    A['pKey'] = A.apply(lambda row: row['a']   "_"   row['b']   "_"   str(row['c'])   "_"   str(row['d']), axis=1)
    B['pKey'] = B.apply(lambda row: row['a']   "_"   row['b']   "_"   str(row['c'])   "_"   str(row['d']), axis=1)
    C['pKey'] = C.apply(lambda row: row['a']   "_"   row['b']   "_"   str(row['c'])   "_"   str(row['d']), axis=1)

Then combine the tables using this new column:

merge_ab = A.merge(B, on='pKey', how='left', suffixes=('_A', '_B'))
merge_abc = merge_ab.merge(C, on='pKey', how='left', suffixes=('', '_C'))

Now drop useless columns.

    a   b   c   d   e_A     e_B     e_C
0   x   y   0   1   0.99    0.12    NaN
1   x   y   1   1   0.43    NaN     0.06
2   x   z   0   0   0.93    0.01    0.65
3   y   z   0   1   0.11    0.45    NaN
4   x   z   0   1   0.78    NaN     0.20

CodePudding user response:

You can use a dictionary to hold your dataframes and use functools.reduce:

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out= reduce(lambda a, b: a.merge(b, how='left', on=["a", "b", "c", "d"]),
            [d.rename(columns={'e': f'e_{k}'}) for k,d in dfs.items()])

Or, if you have non-merge columns other than "e":

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out = (reduce(lambda a, b: a.join(b, how='left'),
              [d.set_index(["a", "b", "c", "d"]).add_suffix(f'_{k}') 
               for k,d in dfs.items()])
       .reset_index()
      )

output:

   a  b  c  d   e_A   e_B   e_C
0  x  y  0  1  0.99  0.12   NaN
1  x  y  1  1  0.43   NaN  0.06
2  x  z  0  0  0.90  0.01  0.65
3  y  z  0  1  0.11  0.45   NaN
4  x  z  0  1  0.78   NaN  0.20
  • Related