Home > Enterprise >  how to merge two dataframe with different times and sizes
how to merge two dataframe with different times and sizes

Time:02-04

I am trying to merge these two dataframe together and preserve all the rows and columns. They have different times under the column 'time', so i want them to merge in a way that is time sequential.

df1:

time run_id weight
0 H1 500
24 H1 400
48 H1 300
0 H2 900
24 H2 800
48 H2 700

df2:

time run_id totalizer
0.5 H1 100
10 H1 200
40 H1 300
60 H1 400
0.5 H2 900
5 H2 1000
35 H2 1100
70 H2 1200

How do I merge these two tables into:

time run_id weight totalizer
0 H1 500
0.5 H1 100
10 H1 200
24 H1 400
40 H1 300
48 H1 300
60 H1 400
0 H2 900
0.5 H2 900
5 H2 1000
24 H2 800
35 H2 1100
48 H2 700
70 H2 1200

I tried:

mergedf = df1.merge(df2, how='outer')

but it stacked df1 on top of df2.

CodePudding user response:

One option is to use combine_first :

cols = ["run_id", "time"]
​
out = (
        df1.set_index(cols)
              .combine_first(df2.set_index(cols))
              .reset_index().sort_values(by=cols)
           [["time", "run_id", "weight", "totalizer"]]
       )

​Output :

print(out)

    time run_id  weight  totalizer
0    0.0     H1   500.0        NaN
1    0.5     H1     NaN      100.0
2   10.0     H1     NaN      200.0
3   24.0     H1   400.0        NaN
4   40.0     H1     NaN      300.0
5   48.0     H1   300.0        NaN
6   60.0     H1     NaN      400.0
7    0.0     H2   900.0        NaN
8    0.5     H2     NaN      900.0
9    5.0     H2     NaN     1000.0
10  24.0     H2   800.0        NaN
11  35.0     H2     NaN     1100.0
12  48.0     H2   700.0        NaN
13  70.0     H2     NaN     1200.0

CodePudding user response:

You could simply add line after what you have already done:

mergedf = df1.merge(df2, how='outer') # your current code
mergedf.sort_values(['run_id', 'time']) # add this

Read more here: https://stackoverflow.com/a/17141755/2650341

CodePudding user response:

You can use panda's merge_ordered

df_merged=pd.merge_ordered(df1,df2, on=['run_id','time'])
  • Related