How to find first and last time for every day for each value-CodePudding

I am trying to figure out a way to find the first and last time stamp for each asset within a dataframe for each day. For example, I have this data frame:

import pandas as pd
data = {
    'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
           '2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
            '2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
            '2022-01-03','2022-01-03','2022-01-03'],
    'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
           '9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
           '8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
    'Asset':[111,111,111,222,222,222,222,
             111,111,111,111,111,222,222,222,
            333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df

Which looks like:

    Date    Time    Asset
0   2022-01-01  12:01   111
1   2022-01-01  12:05   111
2   2022-01-01  14:07   111
3   2022-01-01  11:01   222
4   2022-01-01  13:06   222
5   2022-01-01  17:12   222
6   2022-01-01  15:15   222
7   2022-01-02  9:02    111
8   2022-01-02  8:06    111
9   2022-01-02  14:06   111
10  2022-01-02  19:19   111
11  2022-01-02  10:00   111
12  2022-01-02  13:01   222
13  2022-01-02  17:00   222
14  2022-01-02  10:15   222
15  2022-01-03  8:00    333
16  2022-01-03  9:00    333
17  2022-01-03  7:15    111
18  2022-01-03  16:04   111
19  2022-01-03  15:02   111
20  2022-01-03  17:10   111
21  2022-01-03  12:06   333
22  2022-01-03  15:00   111

I would like to group this data by day and remove all duplicates for each asset in each day, only keeping the first and last timestamp for each value within each day. My ideal outcome would look like this:

data1 = {
    'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01',
           '2022-01-02','2022-01-02','2022-01-02','2022-01-02',
           '2022-01-03','2022-01-03','2022-01-03','2022-01-03',],
    'Time':['12:01','14:07','11:01','17:12',
           '8:06','19:19','10:15','17:00',
           '8:00','12:06','7:15','17:10'],
    'Asset':[111,111,222,222,
            111,111,222,222,
            333,333,111,111]
}
df1 = pd.DataFrame(data1)
df1

Looking like:

Date    Time    Asset
0   2022-01-01  12:01   111
1   2022-01-01  14:07   111
2   2022-01-01  11:01   222
3   2022-01-01  17:12   222
4   2022-01-02  8:06    111
5   2022-01-02  19:19   111
6   2022-01-02  10:15   222
7   2022-01-02  17:00   222
8   2022-01-03  8:00    333
9   2022-01-03  12:06   333
10  2022-01-03  7:15    111
11  2022-01-03  17:10   111

Ideally, I would like to solve this problem in Python, however if there is an easier solution in R or SQL I am able to use those. Any help would be appreciated! Thanks in advance!

CodePudding user response：

import pandas as pd
data = {
    'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
           '2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
            '2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
            '2022-01-03','2022-01-03','2022-01-03'],
    'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
           '9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
           '8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
    'Asset':[111,111,111,222,222,222,222,
             111,111,111,111,111,222,222,222,
            333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df_f = df.groupby(by=['Date', 'Asset']).first().reset_index()
df_l = df.groupby(by=['Date', 'Asset']).last().reset_index()
df_fl = pd.concat([df_f, df_l])[['Date', 'Time', 'Asset']]
df_fl = df_fl.sort_values(by=['Date', 'Asset', 'Time']).reset_index().drop(columns=['index'])
print(df_fl)

prints

index	Date	Time	Asset
0	2022-01-01	12:01	111
1	2022-01-01	14:07	111
2	2022-01-01	11:01	222
3	2022-01-01	15:15	222
4	2022-01-02	10:00	111
5	2022-01-02	9:02	111
6	2022-01-02	10:15	222
7	2022-01-02	13:01	222
8	2022-01-03	15:00	111
9	2022-01-03	7:15	111
10	2022-01-03	12:06	333
11	2022-01-03	8:00	333

Note: Times and dates will only be sorted properly if they are date / time values and not strings.

CodePudding user response：

here is one way to do it

use groupby to get the max and min of the time for each Date and asset, then stack and reset index.

#convert time (in str) to time with a two digit hour and two digit minute, to get the right min/max

df['Time']=df['Time'].astype('datetime64').dt.strftime('%H:%M')

df.groupby(['Date', 'Asset'] )['Time'].agg({'max', 'min'}).stack().reset_index().rename(columns={0:'Time'}).drop(columns='level_2').sort_values(['Date','Asset','Time'])

          Date  Asset   Time
1   2022-01-01  111     12:01
0   2022-01-01  111     14:07
3   2022-01-01  222     11:01
2   2022-01-01  222     17:12
5   2022-01-02  111     08:06
4   2022-01-02  111     19:19
7   2022-01-02  222     10:15
6   2022-01-02  222     17:00
9   2022-01-03  111     07:15
8   2022-01-03  111     17:10
11  2022-01-03  333     08:00
10  2022-01-03  333     12:06