I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] = 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00
CodePudding user response:
You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]