I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.
My one day in my data frame looks somewhat like this
DATE Value Distance
17 1979-1-2T00:00:00.0 15.5669870447436 34.87
18 1979-1-2T01:00:00.0 81.6306803714536 31.342
19 1979-1-2T02:00:00.0 83.1854759740486 33.264
20 1979-1-2T03:00:00.0 23.8659679630303 32.34
21 1979-1-2T04:00:00.0 63.2755504429306 31.973
22 1979-1-2T05:00:00.0 91.2129044773733 34.091
23 1979-1-2T06:00:00.0 76.493130052689 36.837
24 1979-1-2T07:00:00.0 63.5443183375785 34.383
25 1979-1-2T08:00:00.0 40.9255407683688 35.275
26 1979-1-2T09:00:00.0 54.5583051827551 32.152
27 1979-1-2T10:00:00.0 26.2690011881422 35.104
28 1979-1-2T11:00:00.0 71.3059740399097 37.28
29 1979-1-2T12:00:00.0 54.0111262724049 38.963
30 1979-1-2T13:00:00.0 91.3518048568241 36.696
31 1979-1-2T14:00:00.0 81.7651763485069 34.832
32 1979-1-2T15:00:00.0 90.5695814525067 35.473
33 1979-1-2T16:00:00.0 88.4550315358515 30.998
34 1979-1-2T17:00:00.0 41.6276969038137 32.353
35 1979-1-2T18:00:00.0 79.3818377264749 30.15
36 1979-1-2T19:00:00.0 79.1672568582629 37.07
37 1979-1-2T20:00:00.0 1.48337999844262 28.525
38 1979-1-2T21:00:00.0 87.9110385474789 38.323
39 1979-1-2T22:00:00.0 38.6646421460678 23.251
40 1979-1-2T23:00:00.0 88.4920153764757 31.236
I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with
df_short = df.groupby(df.index.floor('D'))["Distance"].min()
But I also want the Value column in my final result. How do I keep all columns? It doesn't seem to work if I do
df_short = df.groupby(df.index.floor('D')).min(["Distance"])
This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does. Maybe this is already posted somewhere, but I have trouble finding it.
CodePudding user response:
You can use aggregate
df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})
If you want the kept Value
column is the same with minimum of Distance
column:
df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]
CodePudding user response:
Make a datetime Index:
df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)
Resample and find the min Distance's location:
df.loc[df.resample('D')['Distance'].idxmin()]
Output:
Value Distance
DATE
1979-01-02 22:00:00 38.664642 23.251