I have a dataset that looks like this:
id1 | id2 | time | |
---|---|---|---|
0 | 56 | 99 | 2007-04-06 15:49:21 |
1 | 56 | 104 | 2007-04-06 18:26:13 |
2 | 56 | 104 | 2007-04-13 11:27:52 |
3 | 56 | 104 | 2007-04-13 11:28:41 |
4 | 56 | 104 | 2007-04-13 11:28:52 |
5 | 56 | 104 | 2007-04-13 11:33:25 |
6 | 56 | 104 | 2007-04-13 14:35:52 |
7 | 104 | 56 | 2007-04-13 11:28:23 |
8 | 104 | 56 | 2007-04-13 11:29:46 |
9 | 128 | 105 | 2007-03-27 18:39:45 |
10 | 217 | 256 | 2007-03-29 14:55:57 |
I would like to drop all observation where for the same pair of IDs the time value is within 5 minutes of the previous row. It also should be "rolling" meaning if there are three observation where the second is 4 minutes from the first and the third is 4 minutes from the second, I only keep the first row. Also it doesn't matter if an Id is in the id1 or Id2 column.
So the output of the dataframe above should be:
id1 | id2 | time | |
---|---|---|---|
0 | 56 | 99 | 2007-04-06 15:49:21 |
1 | 56 | 104 | 2007-04-06 18:26:13 |
2 | 56 | 104 | 2007-04-13 11:27:52 |
3 | 56 | 104 | 2007-04-13 14:35:52 |
4 | 128 | 105 | 2007-03-27 18:39:45 |
5 | 217 | 256 | 2007-03-29 14:55:57 |
The best I could come-up is:
for i in range(1, len(df)):
if df['time'].iloc[i] <= df['time'].iloc[i-1] pd.Timedelta(minutes=5):
df = df.drop(i)
df = df.reset_index(drop=True)
else:
continue
but: 1. it raises an indexer is out-of-bounds error. 2. It doesn't "roll". 3. It distinguishes if an id is in the id1 or id2 column.
Thank you in advance with your help!
CodePudding user response:
m = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff()).le(pd.Timedelta(minutes=5))['time']
df = df.loc[~(m | m.shift(-1))]
Step by step introduction:
You can group by id1
and id2
columns and diff the time
column
diff = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff())
print(diff)
time
0 NaT
1 NaT
2 6 days 16:55:39
3 0 days 00:06:49
4 0 days 00:00:11
5 0 days 00:05:33
6 0 days 03:01:27
7 NaT
8 0 days 00:01:23
9 NaT
10 NaT
Then compare it with 5 minutes
m = diff.le(pd.Timedelta(minutes=5))
print(m)
time
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 False
10 False
And convert the previous row to True
m = m | m.shift(-1)
print(m)
time
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 False
At last, use booleaning indexing to select the False row
df = df.loc[~m['time']]
print(df)
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:21:52
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
9 128 105 2007-03-27 18:39:45
10 217 256 2007-03-29 14:55:57
CodePudding user response:
I think there is still a mistake in your expected output. Entry with index 5 is also valid, so index 6 is not.
Here is a solution using groupby
and diff
:
MIN_LIMIT = 5
def remove_duplicated_entries(df):
time_diff = df['time'].diff()
return df[(time_diff.dt.seconds > MIN_LIMIT*60) | (time_diff.isna())]
# Created to sort ids to ignore the order in the groupby. You can reuse id1 and id2 instead if you don't care.
df[['id_min', 'id_max']] = np.sort(df[['id1', 'id2']], axis=1)
clean_df = df.sort_values('time').groupby(['id_min', 'id_max'], as_index=False).apply(remove_duplicated_entries).reset_index(drop=True).drop(columns=['id_min', 'id_max'])
Which results in:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
2 56 104 2007-04-13 11:27:52
3 56 104 2007-04-13 11:34:25
4 56 104 2007-04-13 14:35:52
5 128 105 2007-03-27 18:39:45
6 217 256 2007-03-29 14:55:57
Please note that using diff assumes the time
column is sorted in ascending order (hence the sort_values
to make sure it is the case).
EDIT after discussion (see comments): The following test set breaks the answer from @Ynjxsjmh too (note the new items at the end):
df = pd.DataFrame({'id1':[56, 56, 56, 56, 56, 56, 56, 128, 217, 104, 104],
'id2':[99, 104, 104, 104, 104, 104, 104, 105, 256, 56, 56],
'time': pd.to_datetime(['2007-04-06 15:49:21', '2007-04-06 18:26:13', '2007-04-13 11:27:52', '2007-04-13 11:28:41',
'2007-04-13 11:28:52', '2007-04-13 11:34:25', '2007-04-13 14:35:52',
'2007-03-27 18:39:45', '2007-03-29 14:55:57', '2007-04-13 11:34:35', '2007-04-13 14:36:35'])})
The result with their answer is:
id1 id2 time
0 56 99 2007-04-06 15:49:21
1 56 104 2007-04-06 18:26:13
5 56 104 2007-04-13 11:34:25
6 56 104 2007-04-13 14:35:52
7 128 105 2007-03-27 18:39:45
8 217 256 2007-03-29 14:55:57
9 104 56 2007-04-13 11:34:35
10 104 56 2007-04-13 14:36:35
CodePudding user response:
If there is no order (ascending/descending) for the time
column use the code from @Ynjxsjmh as is.
and modify the code from @user2246849 by removing sort_values('time')
Thank you to both, for sharing these different approaches.