If I have a DataFrame as below:
Letter | Time | |
---|---|---|
0 | x | 2021-01-01 14:00:00 |
1 | y | 2021-01-01 18:00:00 |
2 | y | 2021-01-03 14:00:00 |
How would I delete a row if a value in the Time column(datetime) is within say 14 hours from the time in the row above?
I've tried using:
from datetime import datetime, timedelta
for i, row in enumerate(df):
if i > 0:
if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):
df = df.drop(i)
else:
pass
else:
pass
but I get KeyError 1 in relation to the line
if df.at[i, 'Time'] - df.at[i-1, 'Time'] < timedelta(hours=14):
CodePudding user response:
You could create a boolean mask by using shift
rsub
(finds the difference between consecutive times) div
(convert to hours) and filter it:
msk = df['Time'].shift().rsub(df['Time']).div(np.timedelta64(1, 'h')) > 14
out = df[msk]
Output:
Letter Time
2 y 2021-01-03 14:00:00
CodePudding user response:
If a timestamp is within 14hours of an earlier timestamp, does its removal depend on whether the earlier timestamp is removed or not? This answer considers the situation where the answer to this question is "yes". (If the answer is "no" then the resulting solution for the test data below would be the first timestamp only).
setup
test data:
import pandas as pd
timestamps = pd.Series([0, 6,10,14,16,29,33,45,46]).apply(pd.Timedelta, unit="hours") pd.Timestamp("2022")
timestamps
looks like this:
0 2022-01-01 00:00:00
1 2022-01-01 06:00:00
2 2022-01-01 10:00:00
3 2022-01-01 14:00:00
4 2022-01-01 16:00:00
5 2022-01-02 05:00:00
6 2022-01-02 09:00:00
7 2022-01-02 21:00:00
8 2022-01-02 22:00:00
dtype: datetime64[ns]
The solution we are aiming for consists of the 1st, 4th, 6th and 8th timestamps.
solution
This solution will use piso (pandas interval set operations) package. The idea is to create a 14hr window, i.e. interval, for each of your timestamps and iteratively remove timestamps which belong to intervals starting earlier.
import piso
# sort timestamps if not already sorted
timestamps = timestamps.sort_values()
# create 14 hour windows for each timestamp. Can be left-closed or right-closed, but not both
intervals = pd.IntervalIndex.from_arrays(timestamps, timestamps pd.Timedelta("14h"))
# create the "disjoint adjacency matrix", which indicates pairwise if intervals are disjoint
mat = piso.adjacency_matrix(intervals, edges="disjoint")
mat
will be a dataframe, whose index and columns are timestamps
. mat.values
looks like this
array([[False, False, False, True, True, True, True, True, True],
[False, False, False, False, False, True, True, True, True],
[False, False, False, False, False, True, True, True, True],
[ True, False, False, False, False, True, True, True, True],
[ True, False, False, False, False, False, True, True, True],
[ True, True, True, True, False, False, False, True, True],
[ True, True, True, True, True, False, False, False, False],
[ True, True, True, True, True, True, False, False, False],
[ True, True, True, True, True, True, False, False, False]])
set diagonal of this matrix to True
mat.iloc[range(len(mat)),range(len(mat))] = True
We will start with the first interval. From the first row of mat
you can deduce that the second and third interval need to be dropped. So we filter out the rows and columns corresponding to these intervals, then move the next interval (row) and so on until we reach the last row. Note we do not need to check any intersections for the last row.
i = 0
while i < len(mat) -1:
mat = mat.loc[mat.iloc[i],mat.iloc[i]]
i =1
The result will be a dataframe whose values are all True. More importantly, the index (and columns) will be intervals whose left endpoints are the timestamps remaining after removing those within 14hrs.
i.e. pd.Series(mat.index.left)
gives
0 2022-01-01 00:00:00
1 2022-01-01 14:00:00
2 2022-01-02 05:00:00
3 2022-01-02 21:00:00
dtype: datetime64[ns]
You can use this to filter your original dataframe using pandas.Series.isin