Home > Software design >  Joining dataframes based on the comparison of two columns
Joining dataframes based on the comparison of two columns

Time:12-14

my starting point looks like this

import pandas as pd

dx = {
    'bezeichnung': ['Erlöse' , 'Aufwand' , 'Material_19' , 'Gewinn'] ,
    '2020': ['1001' , '900' , '90' , '11']
}

dx = {
    'bezeichnung': ['Aufwand' , 'Erlöse' , 'Material_16' , 'Gewinn'] ,
    '2019': ['1900' , '2001' , '80' , '21']
}

df1 = pd.DataFrame(dx)
df2 = pd.DataFrame(dy)

I want basically the following:

  • Compare the columns named 'bezeichnung'. If the elements in both columns are equal to another, add the respective value of '2019' in a new column '2019' which should be either added to df1 or a new df3.
  • If an element of 'bezeichnung' in df2 is not found in df1 add the element at the end of column 'bezeichnung' in df1 and put the corresponding value in df2 '2019' to the added column '2019' (see above).
  • It is important that the order within column 'bezeichnung' in df1 is maintained.

The result should look like this:

df1 = pd.DataFrame('bezeichnung': ['Erlöse' , 'Aufwand' , 'Material_19' , 'Gewinn', 'Material_16'] ,
    '2020': ['1001' , '900' , '90' , '11', '0'] ,
    '2019': ['2001' , '1900' , '0' , '21', '80'])

Thank you very much!

CodePudding user response:

You can use outer merge which combines all the rows for left and right dataframes with NaN when there are no matched values in the rows. In your case, I change that to string '0'. Change indicator=True to see the difference between the two dataframes in an additional column names _merge

import pandas as pd
import numpy as np

dx = {
    'bezeichnung': ['Erlöse' , 'Aufwand' , 'Material_19' , 'Gewinn'] ,
    '2020': ['1001' , '900' , '90' , '11']
}

dy = {
    'bezeichnung': ['Aufwand' , 'Erlöse' , 'Material_16' , 'Gewinn'] ,
    '2019': ['1900' , '2001' , '80' , '21']
}

# your expected results
df = pd.DataFrame({'bezeichnung': ['Erlöse' , 'Aufwand' , 'Material_19' , 'Gewinn', 'Material_16'] ,
    '2020': ['1001' , '900' , '90' , '11', '0'] ,
    '2019': ['2001' , '1900' , '0' , '21', '80']})

df1 = pd.DataFrame(dx)
df2 = pd.DataFrame(dy)
col = 'bezeichnung'
temp = pd.merge(df1, df2, on=col, how='outer', indicator=False)
temp = temp.replace(np.nan, '0')
# you can check if two dfs are equal
if temp.equals(df):
    print('Two dfs are equal.\n')
print(temp)

output:

Two dfs are equal.

   bezeichnung  2020  2019
0       Erlöse  1001  2001
1      Aufwand   900  1900
2  Material_19    90     0
3       Gewinn    11    21
4  Material_16     0    80

  • Related