How to remove datetime values in a row that are within a certain time relative to another row?-CodePudding

If I have a DataFrame as below:

	Letter	Time
0	x	2021-01-01 14:00:00
1	y	2021-01-01 18:00:00
2	y	2021-01-03 14:00:00

How would I delete a row if a value in the Time column(datetime) is within say 14 hours from the time in the row above?

I've tried using:

from datetime import datetime, timedelta
for i, row in enumerate(df):
    if i > 0:
        if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):
            df = df.drop(i)
        else:
            pass
    else:
        pass

but I get KeyError 1 in relation to the line

if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):

CodePudding user response：

You could create a boolean mask by using shift rsub (finds the difference between consecutive times) div (convert to hours) and filter it:

msk = df['Time'].shift().rsub(df['Time']).div(np.timedelta64(1, 'h')) > 14
out = df[msk]

Output:

  Letter                Time
2      y 2021-01-03 14:00:00

CodePudding user response：

If a timestamp is within 14hours of an earlier timestamp, does its removal depend on whether the earlier timestamp is removed or not? This answer considers the situation where the answer to this question is "yes". (If the answer is "no" then the resulting solution for the test data below would be the first timestamp only).

setup

test data:

import pandas as pd

timestamps = pd.Series([0, 6,10,14,16,29,33,45,46]).apply(pd.Timedelta, unit="hours")   pd.Timestamp("2022")

timestamps looks like this:

0   2022-01-01 00:00:00
1   2022-01-01 06:00:00
2   2022-01-01 10:00:00
3   2022-01-01 14:00:00
4   2022-01-01 16:00:00
5   2022-01-02 05:00:00
6   2022-01-02 09:00:00
7   2022-01-02 21:00:00
8   2022-01-02 22:00:00
dtype: datetime64[ns]

The solution we are aiming for consists of the 1st, 4th, 6th and 8th timestamps.

solution

This solution will use piso (pandas interval set operations) package. The idea is to create a 14hr window, i.e. interval, for each of your timestamps and iteratively remove timestamps which belong to intervals starting earlier.

import piso

# sort timestamps if not already sorted
timestamps = timestamps.sort_values()

# create 14 hour windows for each timestamp.  Can be left-closed or right-closed, but not both
intervals = pd.IntervalIndex.from_arrays(timestamps, timestamps pd.Timedelta("14h"))

# create the "disjoint adjacency matrix", which indicates pairwise if intervals are disjoint
mat = piso.adjacency_matrix(intervals, edges="disjoint")

mat will be a dataframe, whose index and columns are timestamps. mat.values looks like this

array([[False, False, False,  True,  True,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [False, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False,  True,  True,  True,  True],
       [ True, False, False, False, False, False,  True,  True,  True],
       [ True,  True,  True,  True, False, False, False,  True,  True],
       [ True,  True,  True,  True,  True, False, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True,  True, False, False, False]])

set diagonal of this matrix to True

mat.iloc[range(len(mat)),range(len(mat))] = True

We will start with the first interval. From the first row of mat you can deduce that the second and third interval need to be dropped. So we filter out the rows and columns corresponding to these intervals, then move the next interval (row) and so on until we reach the last row. Note we do not need to check any intersections for the last row.

i = 0
while i < len(mat) -1:
    mat = mat.loc[mat.iloc[i],mat.iloc[i]]
    i =1

The result will be a dataframe whose values are all True. More importantly, the index (and columns) will be intervals whose left endpoints are the timestamps remaining after removing those within 14hrs.

i.e. pd.Series(mat.index.left) gives

0   2022-01-01 00:00:00
1   2022-01-01 14:00:00
2   2022-01-02 05:00:00
3   2022-01-02 21:00:00
dtype: datetime64[ns]

You can use this to filter your original dataframe using pandas.Series.isin