I have a pandas dataframe as follows
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
I want to check which of these [ts_start, ts_end] intervals are overlapped, for the same machine. I have seen some questions about finding overlaps, but couldn't find that grouped by another column, in my case considering the overlaps for each machine separately.
I tried using Piso which seems very interesting.
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
I obtain something like this:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
However, it is considering all machines at the same time. Is there a way (using piso or not) to get the overlapping moments, for each machine, in a single dataframe?
CodePudding user response:
Here's a way to do what your question asks:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
Input:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
Output:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
CodePudding user response:
Assuming the overlap is only checked up by minutes, you could try:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0