Home > OS >  Join two dataframes on multiple conditions in python
Join two dataframes on multiple conditions in python

Time:08-29

I have the following problem: i am trying to join df1 = ['ID, 'Earnings', 'WC, 'Year'] and df2 = ['ID', 'F1_Earnings', 'df2_year']. So for example: the 'F1_Earnings' of a particular company, e.g. with ID = 1 and year = 1996, in df2 (aka. the Forward Earnings) should get joined on df1 in a way that they show up in df1 under ID = 1 and year = 1995. I have no clue how to specify a join on two conditions, of course they need to join on "ID", but how do I add a second condition which specifies that they also join on "df1_year = df2_year - 1"?

d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350], 'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52], 'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998]}

df1 = pd.DataFrame(data=d1)

d2 = {'ID': [1, 2, 3, 4], 'F1_Earnings': [120, 220, 420, 280], 'WC': [23, 37, 40, 52], 'Year': [1996, 1997, 1998, 1999]}

df2 = pd.DataFrame(data=d2)

I did the following, but I guess there miust be a smarter way? I am afraid it wont work for larger datasets...:

    df3 = pd.merge(df1, df2, how='left', on = 'ID')
    df3.loc[df3['Year_x'] == df3['Year_y'] - 1]

CodePudding user response:

You can use a Series as key in merge:

df1.merge(df2, how='left',
          left_on=['ID', 'Year'],
          right_on=['ID', df2['Year'].sub(1)])

output:

    ID  Year  Earnings  WC_x  Year_x  F1_Earnings  WC_y  Year_y
0    1  1995       100    20    1995        120.0  23.0  1996.0
1    1  1996       200    40    1996          NaN   NaN     NaN
2    1  1997       400    35    1997          NaN   NaN     NaN
3    2  1996       250    55    1996        220.0  37.0  1997.0
4    2  1997       300    60    1997          NaN   NaN     NaN
5    2  1998       350    65    1998          NaN   NaN     NaN
6    3  1995       400    30    1995          NaN   NaN     NaN
7    3  1997       550    28    1997        420.0  40.0  1998.0
8    3  1998       700    32    1998          NaN   NaN     NaN
9    4  1996       259    45    1996          NaN   NaN     NaN
10   4  1997       300    60    1997          NaN   NaN     NaN
11   4  1998       350    52    1998        280.0  52.0  1999.0

Or change the Year to Year-1, before the merge:

df1.merge(df2.assign(Year=df2['Year'].sub(1)),
          how='left', on=['ID', 'Year'])

output:

    ID  Earnings  WC_x  Year  F1_Earnings  WC_y
0    1       100    20  1995        120.0  23.0
1    1       200    40  1996          NaN   NaN
2    1       400    35  1997          NaN   NaN
3    2       250    55  1996        220.0  37.0
4    2       300    60  1997          NaN   NaN
5    2       350    65  1998          NaN   NaN
6    3       400    30  1995          NaN   NaN
7    3       550    28  1997        420.0  40.0
8    3       700    32  1998          NaN   NaN
9    4       259    45  1996          NaN   NaN
10   4       300    60  1997          NaN   NaN
11   4       350    52  1998        280.0  52.0
  • Related