Home > database >  Collapse pandas DataFrame based on daily column value
Collapse pandas DataFrame based on daily column value

Time:05-15

I have a pandas DataFrame with multiple measurements per day (for example hourly measurements, but that is not necessarily the case), but I want to keep only the hour for which a certain column is the daily minimum.

My one day in my data frame looks somewhat like this

    DATE                Value               Distance
17  1979-1-2T00:00:00.0 15.5669870447436    34.87
18  1979-1-2T01:00:00.0 81.6306803714536    31.342
19  1979-1-2T02:00:00.0 83.1854759740486    33.264
20  1979-1-2T03:00:00.0 23.8659679630303    32.34
21  1979-1-2T04:00:00.0 63.2755504429306    31.973
22  1979-1-2T05:00:00.0 91.2129044773733    34.091
23  1979-1-2T06:00:00.0 76.493130052689     36.837
24  1979-1-2T07:00:00.0 63.5443183375785    34.383
25  1979-1-2T08:00:00.0 40.9255407683688    35.275
26  1979-1-2T09:00:00.0 54.5583051827551    32.152
27  1979-1-2T10:00:00.0 26.2690011881422    35.104
28  1979-1-2T11:00:00.0 71.3059740399097    37.28
29  1979-1-2T12:00:00.0 54.0111262724049    38.963
30  1979-1-2T13:00:00.0 91.3518048568241    36.696
31  1979-1-2T14:00:00.0 81.7651763485069    34.832
32  1979-1-2T15:00:00.0 90.5695814525067    35.473
33  1979-1-2T16:00:00.0 88.4550315358515    30.998
34  1979-1-2T17:00:00.0 41.6276969038137    32.353
35  1979-1-2T18:00:00.0 79.3818377264749    30.15
36  1979-1-2T19:00:00.0 79.1672568582629    37.07
37  1979-1-2T20:00:00.0 1.48337999844262    28.525
38  1979-1-2T21:00:00.0 87.9110385474789    38.323
39  1979-1-2T22:00:00.0 38.6646421460678    23.251
40  1979-1-2T23:00:00.0 88.4920153764757    31.236

I would like to keep all rows that have the minimum "distance" per day, so for the one day shown above, one would have only one row left (the one with index value 39). I know how to collapse the data frame so that I only have the Distance column left. I can do that - if I first set the DATE as index - with

df_short = df.groupby(df.index.floor('D'))["Distance"].min()

But I also want the Value column in my final result. How do I keep all columns? It doesn't seem to work if I do

df_short = df.groupby(df.index.floor('D')).min(["Distance"])

This does keep all the columns in the final result, but it seems like the outcome is wrong, so I'm not sure what this does. Maybe this is already posted somewhere, but I have trouble finding it.

CodePudding user response:

You can use aggregate

df_short = df.groupby(df.index.floor('D')).agg({'Distance': min, 'Value': max})

If you want the kept Value column is the same with minimum of Distance column:

df_short = df.loc[df.groupby(df.index.floor('D'))['Distance'].idxmin(), :]

CodePudding user response:

Make a datetime Index:

df.DATE = pd.to_datetime(df.DATE) # If not already datetime.
df.set_index('DATE', inplace=True)

Resample and find the min Distance's location:

df.loc[df.resample('D')['Distance'].idxmin()]

Output:

                         Value  Distance
DATE
1979-01-02 22:00:00  38.664642    23.251
  • Related