How to check if date ranges are overlapping in a pandas dataframe according to a categorical column?-CodePudding

Let's take this sample dataframe :

df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')

   ID   Date_min   Date_max
0   1 2021-01-01 2021-01-23
1   1 2021-01-20 2021-12-01
2   2 2021-01-28 2021-09-01
3   2 2021-01-01 2021-01-15
4   3 2021-01-02 2021-01-09

I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :

L_output = []
for index, row in df.iterrows() :
    if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) & 
              (df["Date_max"]>= row["Date_min"])].index)>1:
        print("overlapping date ranges for ID %d" %row["ID"])
        L_output.append(row["ID"])

Output :

overlapping date ranges for ID 1

Would you know please a better way to check that ID 1 has overlapping date ranges ?

Expected output :

[1]

CodePudding user response：

Try:

Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows

df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
output = df[df.duplicated()]

>>> output
   ID      Dates
1   1 2021-01-20
1   1 2021-01-21
1   1 2021-01-22
1   1 2021-01-23

CodePudding user response：

You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and create a generator of all possible intervals combinations for each ID:

from itertools import combinations
import pandas as pd

def group_has_overlap(group):
    timestamps = group[["Date_min", "Date_max"]].values.tolist()
    for t1, t2 in combinations(timestamps, 2):
        i1 = pd.Interval(t1[0], t1[1])
        i2 = pd.Interval(t2[0], t2[1])
        if i1.overlaps(i2):
            return True
    return False

for ID, group in df.groupby("ID"):
    print(ID, group_has_overlap(group))

Output is :

1 True
2 False
3 False