I have two really big dataframes (several millions lines each) and they both have a column with a timestamp.
df_a is bigger than df_b.
df_b has been resampled by minute, so it has a timestamp for every minute of the period considered.
What I need to do is to take df_a, and for each row, find the corresponding timestamp in the df_b. But they are not necessarily exactly equals. I want to find the timestamp within the same minute.
example:
df_A might have a timestamp like :
2022-04-04 00:00:22
While df_b will have
2022-04-04 00:00:00
In a case like this I would consider that they are corresponding and for the flag to be returned.
What I tried was :
for index, row in df_a.iterrows():
if row['created_at'] in df_b.created_at:
#do something
but as obviously they are not exactly equals because of the seconds it finds nothing....
Is there another way to do it ? I do not wish to resample df_A because it has several rows per minutes and I'm not willing to mix them together.
EDIT: for clarity I will add samples of the df:
df_a (anomaly is all 0):
created_at anomaly
830589 2022-05-12 00:00:13 0
830590 2022-05-12 00:00:16 0
830591 2022-05-12 00:00:18 0
830592 2022-05-12 00:00:18 0
830593 2022-05-12 00:00:30 0
830594 2022-05-12 00:00:30 0
830595 2022-05-12 00:00:32 0
830596 2022-05-12 00:00:32 0
830597 2022-05-12 00:00:33 0
830598 2022-05-12 00:00:33 0
830599 2022-05-12 00:00:43 0
830600 2022-05-12 00:00:43 0
830601 2022-05-12 00:00:43 0
830602 2022-05-12 00:00:43 0
830603 2022-05-12 00:00:43 0
df_b:
created_at anomaly
0 2022-05-12 00:00:00 False
1 2022-05-12 00:01:00 False
2 2022-05-12 00:02:00 False
3 2022-05-12 00:03:00 False
4 2022-05-12 00:04:00 False
5 2022-05-12 00:05:00 False
6 2022-05-12 00:06:00 False
7 2022-05-12 00:07:00 True
I want to set the correct value for anomaly in df_a, taking the value from df_b.
Both dataframes have a lot of columns actually, that's why I don't want to merge them.
CodePudding user response:
sort the time columns and use as_of merge
from datetime import timedelta
pandas.merge_asof(df_a, df_b, on='created_at', tolerance= timedelta(minutes=1))
You need to pay attention to the "direction" parameter
CodePudding user response:
import io
import pandas as pd
csv_1 = """created_at,anomaly
2022-05-12 00:00:13,0
2022-05-12 00:00:16,0
2022-05-12 00:00:18,0
2022-05-12 00:00:18,0
2022-05-12 00:00:30,0
2022-05-12 00:00:30,0
2022-05-12 00:00:32,0
2022-05-12 00:00:32,0
2022-05-12 00:00:33,0
2022-05-12 00:00:33,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
2022-05-12 00:00:43,0
"""
csv_2 = """created_at,anomaly
2022-05-12 00:00:00,False
2022-05-12 00:01:00,False
2022-05-12 00:02:00,False
2022-05-12 00:03:00,False
2022-05-12 00:04:00,False
2022-05-12 00:05:00,False
2022-05-12 00:06:00,False
2022-05-12 00:07:00,True
"""
df_1 = pd.read_csv(io.StringIO(csv_1), parse_dates=['created_at'])
df_2 = pd.read_csv(io.StringIO(csv_2), parse_dates=['created_at'])
df_1.set_index(df_1.created_at.dt.strftime('%Y%m%d%H%M'), inplace=True)
df_2.set_index(df_2.created_at.dt.strftime('%Y%m%d%H%M'), inplace=True)
df_1.anomaly = df_2.anomaly
df_1.reset_index(drop=True, inplace=True)
df_2.reset_index(drop=True, inplace=True)
print(df_1)
created_at anomaly
0 2022-05-12 00:00:13 False
1 2022-05-12 00:00:16 False
2 2022-05-12 00:00:18 False
3 2022-05-12 00:00:18 False
4 2022-05-12 00:00:30 False
5 2022-05-12 00:00:30 False
6 2022-05-12 00:00:32 False
7 2022-05-12 00:00:32 False
8 2022-05-12 00:00:33 False
9 2022-05-12 00:00:33 False
10 2022-05-12 00:00:43 False
11 2022-05-12 00:00:43 False
12 2022-05-12 00:00:43 False
13 2022-05-12 00:00:43 False
14 2022-05-12 00:00:43 False
If there is no correspondence the values of the anomaly column will change to NaN
change with
df_1.anomaly.fillna(0)