Home > front end >  Find intersection in Dataframe with Start and Enddate in Python
Find intersection in Dataframe with Start and Enddate in Python

Time:01-25

I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?

dict = {}
start = "start_time"
end = "end_time"

for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
    matches = []
    dict[(index1, rowLoop1[start])] = 0
    for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
        if index1 != index2:
            if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
                dict[(index1, rowLoop1[start])]  = 1

Code for date_intersection:

def date_intersection(t1start, t1end, t2start, t2end):  
    if (t1start <= t2start <= t2end <= t1end): return True
    elif (t1start <= t2start <= t1end):return True
    elif (t1start <= t2end <= t1end):return True
    elif (t2start <= t1start <= t1end <= t2end):return True
    else: return False

Sample data:

id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00

CodePudding user response:

You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.

df = df.merge(df, how='cross', suffixes=('','_2'))

df['date_intersection'] =  (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) |    # start 2 within start/end
                             (df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) |        # end 2 within start/end
                             (df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) |    # start within start 2/end 2                      
                             (df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) &       # end within start 2/end 2  
                            df['id'].ne(df['id_2'])) # id not compared to itself

and then to return the ids and if they have a date intersection...

df.groupby('id')['date_intersection'].any()

id
21135765     True
41234132     True
64564512     True
87641234    False
87643252     True

or if you need the ids that were intersected

df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')

         intersected_ids
id                      
21135765      [87643252]
41234132      [64564512]
64564512      [41234132]
87643252      [21135765]
  •  Tags:  
  • Related