Home > Back-end >  Merging two dataframes on timestamp while preserving all data
Merging two dataframes on timestamp while preserving all data

Time:05-20

I want to merge two dataframes to create a single time-series with two variables.

I have a function that does this by iterating over each dataframe using itterows()... which is terribly slow and doesn't take advantage of the vectorization that pandas and numpy provide...

Would you be able to help?

This code illustrates what I am trying to do:

a = pd.DataFrame(data={'timestamp':[1,2,5,6,10],'x':[2,6,3,4,2]})
b = pd.DataFrame(data={'timestamp':[2,3,4,10],'y':[3,1,2,1]})

#z = Magical line of code/function call here
#z output: {'timestamp':[1,2,3,4,5,6,10],'x':[2,6,6,6,3,4,2], 'y':[NaN,3,1,2,2,2,1] }

CodePudding user response:

This can be broken down into 2 steps:

  • The first step is the equivalent of an outer join in SQL, where create a table containing keys of both source tables. This is done with merge(..., how="outer")
  • The second is filling the NaN with the previous non-NaN values, which can done with ffill
z = a.merge(b, on="timestamp", how="outer").sort_values("timestamp").ffill()
  • Related