Select sub-df of not more than a time threshold for each id-CodePudding

I have a time series dataframe where the logs of a particular may runs from several minutes to hours. For visualisation purpose, I have to slice the logs of each id to the first 15-minutes data.

Take the example of the df below:

df = pd.DataFrame(
     {'id': [15,15,15,15,15,64,64,64,64,64],
 'timestamp': ['2016-04-01 00:05:00','2016-04-01 00:10:20','2016-04-01 00:13:01',
                '2016-04-01 00:14:00','2016-04-01 00:16:00','2016-04-01 21:04:59',
               '2016-04-01 21:13:05','2016-04-01 21:20:00','2016-04-01 21:25:25',
  '2016-04-01 21:59:59']}
)

df

    id      timestamp
0   15  2016-04-01 00:05:00
1   15  2016-04-01 00:10:20
2   15  2016-04-01 00:13:01
3   15  2016-04-01 00:14:00
4   15  2016-04-01 00:16:00
5   64  2016-04-01 21:04:59
6   64  2016-04-01 21:13:05
7   64  2016-04-01 21:20:00
8   64  2016-04-01 21:25:25
9   64  2016-04-01 21:59:59

By limiting the data of each id to 15-minutes from the start, I want to have the following new_df:

new-df
    id     timestamp
0   15  2016-04-01 00:05:00
1   15  2016-04-01 00:10:20
2   15  2016-04-01 00:13:01
3   15  2016-04-01 00:14:00
4   64  2016-04-01 21:04:59
5   64  2016-04-01 21:13:05

How do I achieve this?

CodePudding user response：

Your new_df doesn't match the output that you would get according to your description. But you can use below and verify your desired output:

df['timestamp'] = pd.to_datetime(df['timestamp'])
d = df.groupby('id').transform(lambda x: (x-x.min()))
d['timestamp'] = d['timestamp'].apply(pd.Timedelta.total_seconds).div(60)
new_df = df[d['timestamp'] <= 15]

print(new_df):

   id           timestamp
0  15 2016-04-01 00:05:00
1  15 2016-04-01 00:10:20
2  15 2016-04-01 00:13:01
3  15 2016-04-01 00:14:00
4  15 2016-04-01 00:16:00
5  64 2016-04-01 21:04:59
6  64 2016-04-01 21:13:05

CodePudding user response：

Assuming that the column timestamp is of datetime (using pandas.to_datetime)

df['timestamp'] = pd.to_datetime(df['timestamp'])

Considering that OP wants the timer to start at the beginning of the first timestamp for a given id, the following will do the work

new_df = df.groupby('id').apply(lambda x: x[x.timestamp <= x.timestamp.iloc[0]   pd.Timedelta(minutes=15)])

[Out]:
   id           timestamp
0  15 2016-04-01 00:05:00
1  15 2016-04-01 00:10:20
2  15 2016-04-01 00:13:01
3  15 2016-04-01 00:14:00
4  15 2016-04-01 00:16:00
5  64 2016-04-01 21:04:59
6  64 2016-04-01 21:05:13
7  64 2016-04-01 21:05:20
8  64 2016-04-01 21:05:25
9  64 2016-04-01 21:04:59

If OP wants the timer to start at 0 hours, 0 minutes, and 0 seconds, then the following will do the work

new_df2 = df.groupby('id').apply(lambda x: x[x.timestamp <= x.timestamp.iloc[0].replace(hour=0, minute=0, second=0)   pd.Timedelta(minutes=15)])

[Out]:
      id           timestamp
id                          
15 0  15 2016-04-01 00:05:00
   1  15 2016-04-01 00:10:20
   2  15 2016-04-01 00:13:01
   3  15 2016-04-01 00:14:00

Notes:

In both cases, the new-df that OP shares in the question is different.
In the first operation, x.timestamp.iloc[0] is selecting the first timestamp, and serves the purpose of indicating that the first timestamp will be used to define the start date. In the second operation one is doing the same, but instead of keeping that value, one is replacing it with 0 hours, 0 minutes, 0 seconds.
.iloc basically allows one to get rows/columns at integer locations. As we want the first it is .iloc[0].
One is using pandas.Timedelta to represent the difference between the two dates. In this case it is 15 minutes.