Pandas compare two datframe with datetime and return a flag if it's within the same minute-CodePudding

I have two really big dataframes (several millions lines each) and they both have a column with a timestamp.
df_a is bigger than df_b.
df_b has been resampled by minute, so it has a timestamp for every minute of the period considered.
What I need to do is to take df_a, and for each row, find the corresponding timestamp in the df_b. But they are not necessarily exactly equals. I want to find the timestamp within the same minute.

example:
df_A might have a timestamp like :
2022-04-04 00:00:22
While df_b will have
2022-04-04 00:00:00
In a case like this I would consider that they are corresponding and for the flag to be returned.
What I tried was :

for index, row in df_a.iterrows():
    if row['created_at'] in df_b.created_at:
        #do something

but as obviously they are not exactly equals because of the seconds it finds nothing....
Is there another way to do it ? I do not wish to resample df_A because it has several rows per minutes and I'm not willing to mix them together.

EDIT: for clarity I will add samples of the df:
df_a (anomaly is all 0):

                 created_at  anomaly
830589  2022-05-12 00:00:13        0
830590  2022-05-12 00:00:16        0
830591  2022-05-12 00:00:18        0
830592  2022-05-12 00:00:18        0
830593  2022-05-12 00:00:30        0
830594  2022-05-12 00:00:30        0
830595  2022-05-12 00:00:32        0
830596  2022-05-12 00:00:32        0
830597  2022-05-12 00:00:33        0
830598  2022-05-12 00:00:33        0
830599  2022-05-12 00:00:43        0
830600  2022-05-12 00:00:43        0
830601  2022-05-12 00:00:43        0
830602  2022-05-12 00:00:43        0
830603  2022-05-12 00:00:43        0

df_b:

            created_at  anomaly
0  2022-05-12 00:00:00    False
1  2022-05-12 00:01:00    False
2  2022-05-12 00:02:00    False
3  2022-05-12 00:03:00    False
4  2022-05-12 00:04:00    False
5  2022-05-12 00:05:00    False
6  2022-05-12 00:06:00    False
7  2022-05-12 00:07:00     True

I want to set the correct value for anomaly in df_a, taking the value from df_b.
Both dataframes have a lot of columns actually, that's why I don't want to merge them.

CodePudding user response：

sort the time columns and use as_of merge

from datetime import timedelta
pandas.merge_asof(df_a, df_b, on='created_at', tolerance= timedelta(minutes=1))

You need to pay attention to the "direction" parameter

CodePudding user response：

import io
import pandas as pd

csv_1 = """created_at,anomaly
2022-05-12 00:00:13,0
2022-05-12 00:00:16,0
2022-05-12 00:00:18,0
2022-05-12 00:00:18,0
2022-05-12 00:00:30,0
2022-05-12 00:00:30,0
2022-05-12 00:00:32,0
2022-05-12 00:00:32,0
2022-05-12 00:00:33,0
2022-05-12 00:00:33,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
"""

csv_2 = """created_at,anomaly
2022-05-12 00:00:00,False
2022-05-12 00:01:00,False
2022-05-12 00:02:00,False
2022-05-12 00:03:00,False
2022-05-12 00:04:00,False
2022-05-12 00:05:00,False
2022-05-12 00:06:00,False
2022-05-12 00:07:00,True
"""



df_1 = pd.read_csv(io.StringIO(csv_1), parse_dates=['created_at'])
df_2 = pd.read_csv(io.StringIO(csv_2), parse_dates=['created_at'])

df_1.set_index(df_1.created_at.dt.strftime('%Y%m%d%H%M'), inplace=True)
df_2.set_index(df_2.created_at.dt.strftime('%Y%m%d%H%M'), inplace=True)

df_1.anomaly = df_2.anomaly

df_1.reset_index(drop=True, inplace=True)
df_2.reset_index(drop=True, inplace=True)

print(df_1)


            created_at  anomaly
0  2022-05-12 00:00:13    False
1  2022-05-12 00:00:16    False
2  2022-05-12 00:00:18    False
3  2022-05-12 00:00:18    False
4  2022-05-12 00:00:30    False
5  2022-05-12 00:00:30    False
6  2022-05-12 00:00:32    False
7  2022-05-12 00:00:32    False
8  2022-05-12 00:00:33    False
9  2022-05-12 00:00:33    False
10 2022-05-12 00:00:43    False
11 2022-05-12 00:00:43    False
12 2022-05-12 00:00:43    False
13 2022-05-12 00:00:43    False
14 2022-05-12 00:00:43    False

If there is no correspondence the values of the anomaly column will change to NaN

change with

df_1.anomaly.fillna(0)