Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID
if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
CodePudding user response:
Try:
- Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode
the "Dates" columns- get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
output = df[df.duplicated()]
>>> output
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
CodePudding user response:
You can transform your datetime objects into timestamps. Then, construct pd.Interval
objects and create a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False