Get overlapping datetimes from pandas dataframe, considering other field-CodePudding

I have a pandas dataframe as follows

df_sample = pd.DataFrame({
        'machine': [1, 1, 1, 2],
        'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
        'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})

I want to check which of these [ts_start, ts_end] intervals are overlapped, for the same machine. I have seen some questions about finding overlaps, but couldn't find that grouped by another column, in my case considering the overlaps for each machine separately.

I tried using Piso which seems very interesting.

df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])

ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values

I obtain something like this:

machine    ts_start             ts_end             isOverlap
0    1 2022-01-01 20:00:00 2022-01-01 21:00:00          1
1    1 2022-01-01 20:30:00 2022-01-01 21:30:00          1
2    1 2022-01-02 20:30:00 2022-01-02 20:35:00          0
3    2 2022-01-01 19:00:00 2022-01-01 23:00:00          1

However, it is considering all machines at the same time. Is there a way (using piso or not) to get the overlapping moments, for each machine, in a single dataframe?

CodePudding user response：

Here's a way to do what your question asks:

import pandas as pd
df_sample = pd.DataFrame({
        'machine': [1, 1, 1, 2],
        'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
        'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
    if len(x.index) > 1:
        iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
        x.loc[iPrev, 'isOverlap'] = 0
        for i in x.index[1:]:
            if x.loc[i,'ts_start'] < reachOfPrev:
                x.loc[iPrev, 'isOverlap'] = 1
                x.loc[i, 'isOverlap'] = 1
            else:
                x.loc[i, 'isOverlap'] = 0
            if x.loc[i, 'ts_end'] > reachOfPrev:
                iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
    else:
        x['isOverlap'] = 0
    x.isOverlap = x.isOverlap.astype(int)
    return x
    
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)

Input:

   machine             ts_start               ts_end
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00

Output:

   machine             ts_start               ts_end  isOverlap
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00          1
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00          1
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00          0
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00          0

CodePudding user response：

Assuming the overlap is only checked up by minutes, you could try:

#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)

#explode to get one row per minute
df_sample = df_sample.explode("times")

#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)

#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()

>>> output
   machine             ts_start               ts_end  isOverlap
0        1  2022-01-01 20:00:00  2022-01-01 21:00:00          1
1        1  2022-01-01 20:30:00  2022-01-01 21:30:00          1
2        1  2022-01-02 20:30:00  2022-01-02 20:35:00          0
3        2  2022-01-01 19:00:00  2022-01-01 23:00:00          0