I am trying to figure out a way to find the first and last time stamp for each asset within a dataframe for each day. For example, I have this data frame:
import pandas as pd
data = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
'2022-01-03','2022-01-03','2022-01-03'],
'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
'9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
'8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
'Asset':[111,111,111,222,222,222,222,
111,111,111,111,111,222,222,222,
333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df
Which looks like:
Date Time Asset
0 2022-01-01 12:01 111
1 2022-01-01 12:05 111
2 2022-01-01 14:07 111
3 2022-01-01 11:01 222
4 2022-01-01 13:06 222
5 2022-01-01 17:12 222
6 2022-01-01 15:15 222
7 2022-01-02 9:02 111
8 2022-01-02 8:06 111
9 2022-01-02 14:06 111
10 2022-01-02 19:19 111
11 2022-01-02 10:00 111
12 2022-01-02 13:01 222
13 2022-01-02 17:00 222
14 2022-01-02 10:15 222
15 2022-01-03 8:00 333
16 2022-01-03 9:00 333
17 2022-01-03 7:15 111
18 2022-01-03 16:04 111
19 2022-01-03 15:02 111
20 2022-01-03 17:10 111
21 2022-01-03 12:06 333
22 2022-01-03 15:00 111
I would like to group this data by day and remove all duplicates for each asset in each day, only keeping the first and last timestamp for each value within each day. My ideal outcome would look like this:
data1 = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-03','2022-01-03','2022-01-03','2022-01-03',],
'Time':['12:01','14:07','11:01','17:12',
'8:06','19:19','10:15','17:00',
'8:00','12:06','7:15','17:10'],
'Asset':[111,111,222,222,
111,111,222,222,
333,333,111,111]
}
df1 = pd.DataFrame(data1)
df1
Looking like:
Date Time Asset
0 2022-01-01 12:01 111
1 2022-01-01 14:07 111
2 2022-01-01 11:01 222
3 2022-01-01 17:12 222
4 2022-01-02 8:06 111
5 2022-01-02 19:19 111
6 2022-01-02 10:15 222
7 2022-01-02 17:00 222
8 2022-01-03 8:00 333
9 2022-01-03 12:06 333
10 2022-01-03 7:15 111
11 2022-01-03 17:10 111
Ideally, I would like to solve this problem in Python, however if there is an easier solution in R or SQL I am able to use those. Any help would be appreciated! Thanks in advance!
CodePudding user response:
import pandas as pd
data = {
'Date':['2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01','2022-01-01',
'2022-01-01' ,'2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02','2022-01-02',
'2022-01-02','2022-01-02','2022-01-03','2022-01-03','2022-01-03','2022-01-03','2022-01-03',
'2022-01-03','2022-01-03','2022-01-03'],
'Time':['12:01','12:05','14:07','11:01','13:06','17:12','15:15',
'9:02','8:06','14:06','19:19','10:00','13:01','17:00','10:15',
'8:00','9:00','7:15','16:04','15:02','17:10','12:06','15:00'],
'Asset':[111,111,111,222,222,222,222,
111,111,111,111,111,222,222,222,
333,333,111,111,111,111,333,111]
}
df = pd.DataFrame(data)
df_f = df.groupby(by=['Date', 'Asset']).first().reset_index()
df_l = df.groupby(by=['Date', 'Asset']).last().reset_index()
df_fl = pd.concat([df_f, df_l])[['Date', 'Time', 'Asset']]
df_fl = df_fl.sort_values(by=['Date', 'Asset', 'Time']).reset_index().drop(columns=['index'])
print(df_fl)
prints
index | Date | Time | Asset |
---|---|---|---|
0 | 2022-01-01 | 12:01 | 111 |
1 | 2022-01-01 | 14:07 | 111 |
2 | 2022-01-01 | 11:01 | 222 |
3 | 2022-01-01 | 15:15 | 222 |
4 | 2022-01-02 | 10:00 | 111 |
5 | 2022-01-02 | 9:02 | 111 |
6 | 2022-01-02 | 10:15 | 222 |
7 | 2022-01-02 | 13:01 | 222 |
8 | 2022-01-03 | 15:00 | 111 |
9 | 2022-01-03 | 7:15 | 111 |
10 | 2022-01-03 | 12:06 | 333 |
11 | 2022-01-03 | 8:00 | 333 |
Note: Times and dates will only be sorted properly if they are date / time values and not strings.
CodePudding user response:
here is one way to do it
use groupby to get the max and min of the time for each Date and asset, then stack and reset index.
#convert time (in str) to time with a two digit hour and two digit minute, to get the right min/max
df['Time']=df['Time'].astype('datetime64').dt.strftime('%H:%M')
df.groupby(['Date', 'Asset'] )['Time'].agg({'max', 'min'}).stack().reset_index().rename(columns={0:'Time'}).drop(columns='level_2').sort_values(['Date','Asset','Time'])
Date Asset Time
1 2022-01-01 111 12:01
0 2022-01-01 111 14:07
3 2022-01-01 222 11:01
2 2022-01-01 222 17:12
5 2022-01-02 111 08:06
4 2022-01-02 111 19:19
7 2022-01-02 222 10:15
6 2022-01-02 222 17:00
9 2022-01-03 111 07:15
8 2022-01-03 111 17:10
11 2022-01-03 333 08:00
10 2022-01-03 333 12:06