Home > Mobile >  How to search for time series data across two data frames
How to search for time series data across two data frames

Time:03-12

I have two pandas data frames df1 and df2 like the following:

df1 containing all the data of type a in increasing time order:

   type   Date
0  a      1970-01-01
1  a      2008-08-01
2  a      2009-07-24
3  a      2010-09-30
4  a      2011-09-29
5  a      2013-06-11
6  a      2013-12-17
7  a      2015-06-02
8  a      2016-06-14
9  a      2017-06-21
10 a      2018-11-26
11 a      2019-06-03
12 a      2019-12-16

df2 containing all the data of type b in increasing time order:

   type   Date
0  b      2017-11-29
1  b      2018-05-30
2  b      2018-11-26
3  b      2019-06-03
4  b      2019-12-16
5  b      2020-06-18
6  b      2020-12-17
7  b      2021-06-28

A type a entry and a type b entry are determined as matching if the date difference between them is within one year. One type a entry can only match with one other type b entry, and vice versa. Time efficiently, how can I find the maximum amount of matching pairs in increasing time order like the following?

   type1  Date1        type2   Date2
0  a      2017-06-21   b       2017-11-29
1  a      2018-11-26   b       2018-05-30
2  a      2019-06-03   b       2018-11-26
3  a      2019-12-16   b       2019-06-03

CodePudding user response:

Use merge_asof:

df3 = pd.merge_asof(df1.rename(columns={'Date':'Date2', 'type':'type1'}), 
                    df2.rename(columns={'Date':'Date1', 'type':'type2'}), 
                    left_on='Date2', 
                    right_on='Date1', 
                    direction='nearest', 
                    allow_exact_matches=False,
                    tolerance=pd.Timedelta('365 days')).dropna(subset=['Date1'])
print (df3)
   type1      Date2 type2      Date1
9      a 2017-06-21     b 2017-11-29
10     a 2018-11-26     b 2018-05-30
11     a 2019-06-03     b 2018-11-26
12     a 2019-12-16     b 2020-06-18
  • Related