I have a dataframe with one column timestamp
(of type datetime
) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.
One can generate an example dataframe using this code:
def get_random_dates_df(
n=10000,
start=pd.to_datetime('2015-01-01'),
period_duration_days=5,
seed=None
):
if not seed: # from piR's answer
np.random.seed(0)
end = start pd.Timedelta(period_duration_days, 'd'),
n_seconds = int(period_duration_days * 3600 * 24)
random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s') start
return pd.DataFrame(data={"timestamp": random_dates}).reset_index()
df = get_random_dates_df()
it would look like this:
index | timestamp | |
---|---|---|
0 | 0 | 2015-01-03 17:51:27.433696604 |
1 | 1 | 2015-01-04 13:49:21.806272885 |
2 | 2 | 2015-01-04 00:19:53.778462950 |
3 | 3 | 2015-01-03 17:23:09.535054659 |
4 | 4 | 2015-01-03 02:50:18.873314407 |
I think I have a working solution but it seems overly complicated:
gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()
Output:
count
time_of_day
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
... ...
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
[288 rows x 1 columns]
Is there a better way to solve this?
CodePudding user response:
I'd suggest something like this, to avoid trying to merge the results of two groupbys together:
gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()
It works by setting the year/month/day to fixed values and applying the built-in resampling function.
CodePudding user response:
What about flooring the datetimes to 5min, extracting the time only and using value_counts
:
out = (df['timestamp']
.dt.floor('5min')
.dt.time.value_counts(sort=False)
.sort_index()
)
Output:
00:00:00 38
00:05:00 39
00:10:00 48
00:15:00 33
00:20:00 27
..
23:35:00 34
23:40:00 38
23:45:00 37
23:50:00 41
23:55:00 41
Name: timestamp, Length: 288, dtype: int64
CodePudding user response:
You could groupby the floored 5Min datetime's time portion:
df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()