I have a pandas dataframe which looks like this:
Concentr 1 Concentr 2 Time
0 25.4 0.48 00:01:00
1 26.5 0.49 00:02:00
2 25.2 0.52 00:03:00
3 23.7 0.49 00:04:00
4 23.8 0.55 00:05:00
5 24.6 0.53 00:06:00
6 26.3 0.57 00:07:00
7 27.1 0.59 00:08:00
8 28.8 0.56 00:09:00
9 23.9 0.54 00:10:00
10 25.6 0.49 00:11:00
11 27.5 0.56 00:12:00
12 26.3 0.55 00:13:00
13 25.3 0.54 00:14:00
and I want to keep the max value of Concentr 1 of every 5 minute interval, along with the time it occured and the value of concetr 2 at that time. So, for the previous example I would like to have:
Concentr 1 Concentr 2 Time
0 26.5 0.49 00:02:00
1 28.8 0.56 00:09:00
2 27.5 0.56 00:12:00
My current approach would be i) to create and auxiliary variable with an ID for each 5-min interval eg 00:00 to 00:05 will be interval 1, from 00:05 to 00:10 would be interval 2 etc, ii) use the interval variable in a groupby to get the max concentr 1 per interval and iii) merge back to the initial df using both the interval variable and the concentr 1 and thus identifying the corresponding time.
I would like to ask if there is a better / more efficient / more elegant way to do it.
Thank you very much for any help.
CodePudding user response:
You can do a regular resample / groupby, and use the idxmax
method to get the desired row for each group. Then use that to index your original data:
>>> df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
Concentr 1 Concentr 2 Time
1 26.5 0.49 2021-10-09 00:02:00
8 28.8 0.56 2021-10-09 00:09:00
11 27.5 0.56 2021-10-09 00:12:00
This is assuming your 'Time'
column is datetime like, which I did with pd.to_datetime
. You can convert the time column back with strftime
. So in full:
df['Time'] = pd.to_datetime(df['Time'])
result = df.loc[df.resample('5T', on='Time')['Concentr1'].idxmax()]
result['Time'] = result['Time'].dt.strftime('%H:%M:%S')
Giving:
Concentr1 Concentr2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00
CodePudding user response:
df = df.set_index('Time')
idx = df.resample('5T').agg({'Concentr 1': np.argmax})
df = df.iloc[idx.conc]
Then you would probably need to reset_index()
if you do not wish Time to be your index.
CodePudding user response:
You can also use this:
groupby every n=5 nrows and filter the original df based on max index of "Concentr 1"
df = df[df.index.isin(df.groupby(df.index // 5)["Concentr 1"].idxmax())]
print(df)
Output:
Concentr 1 Concentr 2 Time
1 26.5 0.49 00:02:00
8 28.8 0.56 00:09:00
11 27.5 0.56 00:12:00