Home > Software design >  Compare and match range of timestamps in pandas two different dataframes
Compare and match range of timestamps in pandas two different dataframes

Time:11-28

How to compare and match beginning and end of two ranges of timestamps in two different dataframes, when the frequency of timestamps varies, and it is not known which range starts earlies and finishes later. Then discard unmatched beginning and end, so the two ranges are the same. Easy to do it manually in a txt file, how to do it in python and pandas dataframes?

Sample first dataframe:

                         0                          1
0      2022-10-30 14:11:57
1      2022-10-30 14:11:57
2      2022-10-30 14:11:57
3      2022-10-30 14:11:58
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

\[149806 rows x 2 columns\]

Sample second dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11
21067 2022-10-30 15:22:12
21068 2022-10-30 15:22:13
21069 2022-10-30 15:22:13

Column 1 filled with data

Comparing two timestamps in a specific row would look like:

if first_df[0].iloc[0] == second_df[0].iloc[0]:
    print('hit')
else:
    print('miss')

How to do it over full range, so it would be possible to discard unmatched beginning and end while preserving what's inside?

Sample match of those two ranges: First dataframe:

                         0                          1
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

Second dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11

Edit:

Consider this code (note that frequency of timestamps in each dataframe is different):

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'val_1' : [10,11,12,13,14,15]}, 
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      '2022-11-12 09:04:01',
                                      '2022-11-12 09:04:02' 
                                      ] ])

df2 = pd.DataFrame({'val_2': [11,22,33,44]},
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:58',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      ] ])

What I would like as result is this:

                     val_1  val_2
2022-11-12 09:03:59     10    NaN
2022-11-12 09:03:59     11     22
2022-11-12 09:03:59     12     33
2022-11-12 09:04:00     13     44

or: df1:

2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

and df2:

2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

Tried both join and merge with probably every combination of options and can't do that.

CodePudding user response:

New answer on the new example data:

The problem with merging here is that you have duplicated index Dates, so there can't be unambigous assignment done.

But you could do it seperately as you suggested in the beginning. You said you don't know which of both df's have start earlier or end later. Find the min value of both indexes and get the max value of these two. Same for the upper bound, get both max values and take the min value of these two values. Then you slice your df's with the lower and upper bound.

lower, upper = max(df1.index.min(), df2.index.min()), min(df1.index.max(), df2.index.max())

df1 = df1.loc[lower:upper]
print(df1)

                     val_1
2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

df2 = df2.loc[lower:upper]
print(df2)

                     val_2
2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

OLD:
Since you didn't provide usable data, here my own example input data:

np.random.seed(42)
df1 = pd.DataFrame(
    {
        'A' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:00', periods=10, freq='10T')
)

df2 = pd.DataFrame(
    {
        'B' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:30', periods=10, freq='10T')
)

which creates this data:

#df1
                     A
2022-11-26 08:00:00  6
2022-11-26 08:10:00  3
2022-11-26 08:20:00  7
2022-11-26 08:30:00  4
2022-11-26 08:40:00  6
2022-11-26 08:50:00  9
2022-11-26 09:00:00  2
2022-11-26 09:10:00  6
2022-11-26 09:20:00  7
2022-11-26 09:30:00  4

#df2
                     B
2022-11-26 08:30:00  3
2022-11-26 08:40:00  7
2022-11-26 08:50:00  7
2022-11-26 09:00:00  2
2022-11-26 09:10:00  5
2022-11-26 09:20:00  4
2022-11-26 09:30:00  1
2022-11-26 09:40:00  7
2022-11-26 09:50:00  5
2022-11-26 10:00:00  1

I think a decent approach still would be to merge the data to find out the edges that are off. Just a offer, if you leave them merged, you could compare them directly like this:

combined = df1.merge(df2, how='inner', left_index=True, right_index=True)
combined['compare'] = np.where(combined['A']==combined['B'], 'hit', 'miss')
print(combined)

Output of combined:

                     A  B compare
2022-11-26 08:30:00  4  3    miss
2022-11-26 08:40:00  6  7    miss
2022-11-26 08:50:00  9  7    miss
2022-11-26 09:00:00  2  2     hit
2022-11-26 09:10:00  6  5    miss
2022-11-26 09:20:00  7  4    miss
2022-11-26 09:30:00  4  1    miss

If you really need them to stay seperated, just add:

df1_new = combined[['A']]
df2_new = combined[['B']]
  • Related