I am trying to choose the latest time value for a measurement type within an interval of 5 minutes.
data = [
["2017-01-03T10:04:45", "A", "35.79"],
["2017-01-03T10:01:18", "B", "98.78"],
["2017-01-03T10:09:07", "A", "35.01"],
["2017-01-03T10:03:34", "B", "96.49"],
["2017-01-03T10:02:01", "A", "35.82"],
["2017-01-03T10:05:00", "B", "97.17"],
["2017-01-03T10:05:01", "B", "95.08"]
]
df = pd.DataFrame(data, columns=["timestamp", "measurement_type", "measurement_value"])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['measurement_value'] = df['measurement_value'].astype(float)
My DataFrame df:
timestamp | measurement_type | measurement_value |
---|---|---|
2017-01-03 10:04:45 | A | 35.79 |
2017-01-03 10:01:18 | B | 98.78 |
2017-01-03 10:09:07 | A | 35.01 |
2017-01-03 10:03:34 | B | 96.49 |
2017-01-03 10:02:01 | A | 35.82 |
2017-01-03 10:05:00 | B | 97.17 |
2017-01-03 10:05:01 | B | 95.08 |
the desired output ist:
timestamp | measurement_type | measurement_value |
---|---|---|
2017-01-03 10:05:00 | A | 35.79 |
2017-01-03 10:10:00 | A | 35.01 |
2017-01-03 10:05:00 | B | 97.17 |
2017-01-03 10:10:00 | B | 95.08 |
so it should be roundep up to the 5 minutes the timestamp is within. And if the timestamp is equal to the interval boundery it should be considered for the current interval for example: | 2017-01-03 10:05:00 | B | 97.17 | belongs to 10:05:00 and not 10:10:00
I found this
df.groupby(["measurement_type", pd.Grouper(key="timestamp", freq="5min", offset="1sec")])["timestamp"].max()
which puts out:
measurement_type timestamp
SPO2 2017-01-03 10:00:01 2017-01-03 10:05:00
2017-01-03 10:05:01 2017-01-03 10:05:01
TEMP 2017-01-03 10:00:01 2017-01-03 10:04:45
2017-01-03 10:05:01 2017-01-03 10:09:07
Name: timestamp, dtype: datetime64[ns]
but I have the following problems with this:
- the datetime is rounded off but I need it rounded up (I solved this by just adding 5 minutes to each datetime but maybe there is a better option?)
- I used offset="1sec" so the 10:05:00 is considered for the 10:00:00 interval and not the 10:05:00 (keep in mind this is correct because the timestamps are getting rounded off but I actually need them rounded up). I think I will just subtract one second from the timestamp again to get rid of the 01.
- the output is a Series where the measurement_value gets lost. Is it possible to keep the measurement_value column with correspondig value for the chosen timestamp? I mean like getting a DataFrame like in the desired output
I hope I formulated my question in a understandable way
CodePudding user response:
Create a 5min
freq grouper with parameters closed='right'
and label='right'
, then sort the dataframe on timestamp
then groupby
and aggregate with last
to select the row with latest timestamp in the 5 min intervals
g = pd.Grouper(key='timestamp', freq='5min', label='right', closed='right')
df.sort_values('timestamp').groupby(['measurement_type', g]).last().reset_index()
Result
measurement_type timestamp measurement_value
0 A 2017-01-03 10:05:00 35.79
1 A 2017-01-03 10:10:00 35.01
2 B 2017-01-03 10:05:00 97.17
3 B 2017-01-03 10:10:00 95.08