Home > OS >  group datetime column by 5 minutes increment only for time of day (ignoring date) and count
group datetime column by 5 minutes increment only for time of day (ignoring date) and count

Time:03-27

I have a dataframe with one column timestamp (of type datetime) and some other columns but their content don't matter. I'm trying to group by 5 minutes interval and count but ignoring the date and only caring about the time of day.

One can generate an example dataframe using this code:

def get_random_dates_df(
        n=10000,
        start=pd.to_datetime('2015-01-01'),
        period_duration_days=5,
        seed=None
    ):
    if not seed:  # from piR's answer
        np.random.seed(0)
    end = start   pd.Timedelta(period_duration_days, 'd'),
    n_seconds = int(period_duration_days * 3600 * 24)
    random_dates = pd.to_timedelta(n_seconds * np.random.rand(n), unit='s')   start
    return pd.DataFrame(data={"timestamp": random_dates}).reset_index()

df = get_random_dates_df()

it would look like this:

index timestamp
0 0 2015-01-03 17:51:27.433696604
1 1 2015-01-04 13:49:21.806272885
2 2 2015-01-04 00:19:53.778462950
3 3 2015-01-03 17:23:09.535054659
4 4 2015-01-03 02:50:18.873314407

I think I have a working solution but it seems overly complicated:

gpd_df = df.groupby(pd.Grouper(key="timestamp", freq="5min")).agg(
    count=("index", "count")
).reset_index()
gpd_df["time_of_day"] = gpd_df["timestamp"].dt.time
res_df= gpd_df.groupby("time_of_day").sum()

Output:

             count
time_of_day       
00:00:00        38
00:05:00        39
00:10:00        48
00:15:00        33
00:20:00        27
...            ...
23:35:00        34
23:40:00        38
23:45:00        37
23:50:00        41
23:55:00        41

[288 rows x 1 columns]

Is there a better way to solve this?

CodePudding user response:

I'd suggest something like this, to avoid trying to merge the results of two groupbys together:

gpd_df = df.copy()
gpd_df["time_of_day"] = gpd_df["timestamp"].apply(lambda x: x.replace(year=2000, month=1, day=1))
gpd_df = gpd_df.set_index("time_of_day")
res_df = gpd_df.resample("5min").size()

It works by setting the year/month/day to fixed values and applying the built-in resampling function.

CodePudding user response:

What about flooring the datetimes to 5min, extracting the time only and using value_counts:

out = (df['timestamp']
       .dt.floor('5min')
       .dt.time.value_counts(sort=False)
       .sort_index()
      )

Output:

00:00:00    38
00:05:00    39
00:10:00    48
00:15:00    33
00:20:00    27
            ..
23:35:00    34
23:40:00    38
23:45:00    37
23:50:00    41
23:55:00    41
Name: timestamp, Length: 288, dtype: int64

CodePudding user response:

You could groupby the floored 5Min datetime's time portion:

df2 = df.groupby(df['timestamp'].dt.floor('5Min').dt.time)['index'].count()
  • Related