How to find the unique date ranges on a dataframe with overlapping times and minimize processing tim-CodePudding

I have a data frame of about 12million rows. Each unique user has various date ranges in which they had a request open with them. These requests can overlap so I want to grab the unique ranges and in cases of overlap I want to either break the ranges out into unique rows or take the max range, whichever is faster.

I have two main issues I am running into

My query takes over 24h to run. I believe it's highly inefficient but I am stumped on how to improve performance.
This current code does not completely remove overlaps as the arrays returned can have vary. eg. {1,4,5} and {1,5} which the groupby will see as separate unique entires.

below is an example of the code:

import pandas as pd
import random
import numpy as np

base_d = pd.to_datetime('2021-01-16')
start = [base_d pd.to_timedelta(i,'days') for i in range(1,2000)]
end = [x pd.to_timedelta(random.choice([1,2,3]),'days') for x in start]
user = [random.choice(["a","b","c"]) for i in end]


df = pd.DataFrame(data=zip(start,end,user),columns=['start','end','user'])



    
df.loc[:,'shifted'] = df['start'].gt(df['end'].shift()).groupby(df['user']).cumsum()
        
                
        
        
 new_df = df.sort_values(by=['user','start'],ascending=True)
        
 same_grp_msk = new_df['user']==new_df['user'].shift()
        
        
 new_df.reset_index(inplace=True)
        
new_df.loc[same_grp_msk,'end_shift'] = new_df.loc[same_grp_msk,'end'].shift(-1)
new_df.loc[~same_grp_msk,'end_shift'] = np.nan
        
new_df.loc[same_grp_msk,'shifted'] = (new_df.loc[same_grp_msk,'start']>=new_df.loc[same_grp_msk,'end_shift'])*1
new_df.loc[~same_grp_msk,'shifted'] = 0
        
new_df.loc[:,'Interval'] = new_df.apply(lambda x:pd.Interval(left=x['start'], right=x['end']),axis=1)
        def overlap_detect(interval_v,interval_array,index):
            overlap_msk = interval_array['Interval'].map(lambda x:x.overlaps(interval_v))
            
            return set([index] list(interval_array.loc[overlap_msk,'index']))
           
new_df.loc[:,'Overlap key'] = new_df.apply(lambda x:overlap_detect(x['Interval'],new_df.loc[new_df['user']==x['user'],['Interval','index']],x['index']),axis=1)

The apply function is the piece which takes over a day to run but I am unsure how to do this calculation otherwise.

CodePudding user response：

First issue:

new_df.loc[:,'Overlap key'] = new_df.apply(lambda x:overlap_detect(x['Interval'],new_df.loc[new_df['user']==x['user'],['Interval','index']],x['index']),axis=1)

Solution:

new_df= new_df.groupby('user').apply(lambda df: pd.arrays.IntervalArray.from_arrays( df["start"],
            df["end"],
               closed="left")).reset_index()

I had overcomplicated the issue. This solution takes 2min versus the 25h I was looking at before.

Second issue was identifying the overlaps and getting the max ranges

Solution using the piso library recommended by user Riley:

new_df.loc[:,"Downtime"] = new_df.apply(lambda x: piso.union(x["ranges"]),axis=1)

This provides me with the proper subset of time ranges without overlap or duplication

Next I extracted the array into rows and then a start and end column

new_df = new_df.explode("Downtime")
from operator import attrgetter
new_df["Start"] = new_df["Downtime"].map(attrgetter('left'))
new_df["End"] = new_df["Downtime"].map(attrgetter('right'))

CodePudding user response：

Setup

import pandas as pd
import random

base_d = pd.to_datetime('2021-01-16')
start = [base_d pd.to_timedelta(i,'days') for i in range(1,2000)]
end = [x pd.to_timedelta(random.choice([1,2,3]),'days') for x in start]
user = [random.choice(["a","b","c"]) for i in end]


df = pd.DataFrame(data=zip(start,end,user),columns=['start','end','user'])

Solution

Using piso (pandas interval set operations):

import piso

# create pandas Series where values are IntervalIndex
intervals_by_user = df.groupby("user").apply(lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"]))

intervals_by_user looks like this:

 user
a    IntervalIndex([(2021-01-18, 2021-01-21], (2021...
b    IntervalIndex([(2021-01-19, 2021-01-21], (2021...
c    IntervalIndex([(2021-01-17, 2021-01-20], (2021...
dtype: object

Apply piso.union function to each of these IntervalIndex to which will combine overlapping intervals. This gives us a pandas Series again.

disjoint_intervals_by_user = intervals_by_user.apply(piso.union)

Convert the Series back into dataframe format

new_df = pd.concat(
    [
        pd.DataFrame({"start":ii.left, "end":ii.right}).assign(user=user)
        for user, ii in disjoint_intervals_by_user.items()
    ]
).reset_index(drop=True)

new_df:

         start        end user
0   2021-01-18 2021-01-23    a
1   2021-01-26 2021-01-28    a
2   2021-01-29 2021-02-01    a
3   2021-02-02 2021-02-08    a
4   2021-02-09 2021-02-12    a
..         ...        ...  ...
897 2026-06-04 2026-06-06    c
898 2026-06-07 2026-06-11    c
899 2026-06-19 2026-06-23    c
900 2026-06-25 2026-07-02    c
901 2026-07-07 2026-07-08    c

No more overlaps... This is running over 1000x faster than the code you have pasted so hopefully it's doing what you need.

You could also create a function (let's call it remove_overlaps) which takes a dataframe with start, end columns, creates the corresponding IntervalIndex, passes it to piso.union and converts back into dataframe format before returning dataframe and use df.groupby("user").apply(remove_overlaps)