Home > other >  Merge data between dataframes where the column to merge is row dependent
Merge data between dataframes where the column to merge is row dependent

Time:09-29

I've got two dataframes and need to merge some data from the second one into the first but the column to pull data from changes from row to row.

df1 contains columns 'IDs' 'dates' 'choices'

  ID       dates  choices
0  1  2022-07-05  choice1
1  2  2022-07-05  choice2
2  3  2022-07-05  choice2
3  4  2022-01-01  choice2
4  5  2022-07-10  choice7


df1 = pd.DataFrame({'ID': ['1', '2', '3','4','5'],
                    'dates': ['2022-07-05', '2022-07-05', '2022-07-05','2022-01-01','2022-07-10'],
                    'choices': ['choice1', 'choice2', 'choice2','choice2','choice7']})

df2 contains columns 'dates' 'value of choice1' 'value of choice2' 'value of choice3' etc

        dates choice1 choice2 choice3 choice7
0  2022-01-01   122.4      34      20      33
1  2022-07-05   111.0   202.1    31.1    31.1
2  2022-07-10   115.6  1000.2    34.2    51.4


df2 = pd.DataFrame({'dates': ['2022-01-01', '2022-07-05', '2022-07-10'],
                    'choice1': ['122.4', '111.0', '115.6'],
                    'choice2': ['34', '202.1', '1000.2'],
                    'choice3': ['20', '31.1', '34.2'],
                    'choice7': ['33', '31.1', '51.4']})

I need to create a column in df1 that contains the value of the choice in which df1'dates'==df2'dates'

eg, at the end df1 would look like this

  ID       dates  choices values
0  1  2022-07-05  choice1  111.0
1  2  2022-07-05  choice2  202.1
2  3  2022-07-05  choice2  202.1
3  4  2022-01-01  choice2   34.0
4  5  2022-07-10  choice7   51.4


df1_final = pd.DataFrame({'ID': ['1', '2', '3','4','5'],
                          'dates': ['2022-07-05', '2022-07-05', '2022-07-05','2022-01-01','2022-07-10'],
                          'choices': ['choice1', 'choice2', 'choice2','choice2','choice7'],
                          'values': ['111.0', '202.1', '202.1','34.0','51.4']})

Currently I can merge them with a series of masks for each of the choices, Im hoping to find a cleaner/more efficient way to do this (there are a lot of rows and more than 10 choices)

edit: used a different method to display df1 output, added in df1 and df2 tables/code

CodePudding user response:

you can use melt to unpivot your df2 and then join with df1:

df2 = df2.melt(id_vars='dates', var_name='choices')
out = df1.merge(df2 , on=['dates','choices'])

output:

>>
  ID       dates  choices  value
0  1  2022-07-05  choice1  111.0
1  2  2022-07-05  choice2  202.1
2  3  2022-07-05  choice2  202.1
3  4  2022-01-01  choice2     34
4  5  2022-07-10  choice7   51.4
  • Related