Pandas drop rows in consecutive time range and with same pair of features-CodePudding

I have a dataset that looks like this:

	id1	id2	time
0	56	99	2007-04-06 15:49:21
1	56	104	2007-04-06 18:26:13
2	56	104	2007-04-13 11:27:52
3	56	104	2007-04-13 11:28:41
4	56	104	2007-04-13 11:28:52
5	56	104	2007-04-13 11:33:25
6	56	104	2007-04-13 14:35:52
7	104	56	2007-04-13 11:28:23
8	104	56	2007-04-13 11:29:46
9	128	105	2007-03-27 18:39:45
10	217	256	2007-03-29 14:55:57

I would like to drop all observation where for the same pair of IDs the time value is within 5 minutes of the previous row. It also should be "rolling" meaning if there are three observation where the second is 4 minutes from the first and the third is 4 minutes from the second, I only keep the first row. Also it doesn't matter if an Id is in the id1 or Id2 column.

So the output of the dataframe above should be:

	id1	id2	time
0	56	99	2007-04-06 15:49:21
1	56	104	2007-04-06 18:26:13
2	56	104	2007-04-13 11:27:52
3	56	104	2007-04-13 14:35:52
4	128	105	2007-03-27 18:39:45
5	217	256	2007-03-29 14:55:57

The best I could come-up is:

for i in range(1, len(df)):
    if df['time'].iloc[i] <= df['time'].iloc[i-1]   pd.Timedelta(minutes=5):
        df = df.drop(i)
        df = df.reset_index(drop=True)
    else:
        continue

but: 1. it raises an indexer is out-of-bounds error. 2. It doesn't "roll". 3. It distinguishes if an id is in the id1 or id2 column.

Thank you in advance with your help!

CodePudding user response：

m = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff()).le(pd.Timedelta(minutes=5))['time']
df = df.loc[~(m | m.shift(-1))]

Step by step introduction:

You can group by id1 and id2 columns and diff the time column

diff = df.groupby(['id1', 'id2'], as_index=False)['time'].transform(lambda g: g.diff())

print(diff)

              time
0              NaT
1              NaT
2  6 days 16:55:39
3  0 days 00:06:49
4  0 days 00:00:11
5  0 days 00:05:33
6  0 days 03:01:27
7              NaT
8  0 days 00:01:23
9              NaT
10             NaT

Then compare it with 5 minutes

m = diff.le(pd.Timedelta(minutes=5))

print(m)

     time
0   False
1   False
2   False
3   False
4    True
5   False
6   False
7   False
8    True
9   False
10  False

And convert the previous row to True

m = m | m.shift(-1)

print(m)

     time
0   False
1   False
2   False
3    True
4    True
5   False
6   False
7    True
8    True
9   False
10  False

At last, use booleaning indexing to select the False row

df = df.loc[~m['time']]

print(df)

    id1  id2                time
0    56   99 2007-04-06 15:49:21
1    56  104 2007-04-06 18:26:13
2    56  104 2007-04-13 11:21:52
5    56  104 2007-04-13 11:34:25
6    56  104 2007-04-13 14:35:52
9   128  105 2007-03-27 18:39:45
10  217  256 2007-03-29 14:55:57

CodePudding user response：

I think there is still a mistake in your expected output. Entry with index 5 is also valid, so index 6 is not.

Here is a solution using groupby and diff:

MIN_LIMIT = 5

def remove_duplicated_entries(df):
    time_diff = df['time'].diff()
    return df[(time_diff.dt.seconds > MIN_LIMIT*60) | (time_diff.isna())]

# Created to sort ids to ignore the order in the groupby. You can reuse id1 and id2 instead if you don't care.
df[['id_min', 'id_max']] = np.sort(df[['id1', 'id2']], axis=1)
clean_df = df.sort_values('time').groupby(['id_min', 'id_max'], as_index=False).apply(remove_duplicated_entries).reset_index(drop=True).drop(columns=['id_min', 'id_max'])

Which results in:

   id1  id2                time
0   56   99 2007-04-06 15:49:21
1   56  104 2007-04-06 18:26:13
2   56  104 2007-04-13 11:27:52
3   56  104 2007-04-13 11:34:25
4   56  104 2007-04-13 14:35:52
5  128  105 2007-03-27 18:39:45
6  217  256 2007-03-29 14:55:57

Please note that using diff assumes the time column is sorted in ascending order (hence the sort_values to make sure it is the case).

EDIT after discussion (see comments): The following test set breaks the answer from @Ynjxsjmh too (note the new items at the end):


df = pd.DataFrame({'id1':[56, 56, 56, 56, 56, 56, 56, 128, 217, 104, 104], 
                   'id2':[99, 104, 104, 104, 104, 104, 104, 105, 256, 56, 56],
                   'time': pd.to_datetime(['2007-04-06 15:49:21', '2007-04-06 18:26:13', '2007-04-13 11:27:52', '2007-04-13 11:28:41',
                            '2007-04-13 11:28:52', '2007-04-13 11:34:25', '2007-04-13 14:35:52', 
                            '2007-03-27 18:39:45', '2007-03-29 14:55:57', '2007-04-13 11:34:35', '2007-04-13 14:36:35'])})

The result with their answer is:

    id1  id2                time
0    56   99 2007-04-06 15:49:21
1    56  104 2007-04-06 18:26:13
5    56  104 2007-04-13 11:34:25
6    56  104 2007-04-13 14:35:52
7   128  105 2007-03-27 18:39:45
8   217  256 2007-03-29 14:55:57
9   104   56 2007-04-13 11:34:35
10  104   56 2007-04-13 14:36:35

CodePudding user response：

If there is no order (ascending/descending) for the time column use the code from @Ynjxsjmh as is.

and modify the code from @user2246849 by removing sort_values('time')

Thank you to both, for sharing these different approaches.