Home > Back-end >  How to group pandas timestamp column according to whether they are within a threshold time duration
How to group pandas timestamp column according to whether they are within a threshold time duration

Time:05-04

Say I have the following dataframe:

import pandas as pd

df = pd.DataFrame(['4/20/2022 8:23', '4/20/2022 8:20', '4/20/2022 8:20', '4/20/2022 8:20', '4/20/2022 8:20', '4/20/2022 8:20', '4/20/2022 8:17', '4/20/2022 8:17', '4/20/2022 8:17', '4/20/2022 8:17', '4/20/2022 9:56', '4/20/2022 9:50', '4/20/2022 9:50', '4/20/2022 9:48', '4/20/2022 9:48', '4/20/2022 9:48', '4/20/2022 9:48', '4/20/2022 9:47'],
                  columns=['timestamp'])

enter image description here

I want to add another column called Cluster which will put all the rows in the same cluster ID if they occur within, say, a half hour time duration. So, rows 0 to 9 should be cluster 0, and rows 10 to 17 should be cluster 1.

How do I do this in the best way?

I'm doing it in quite a patchy way (but at least correctly, I hope):

import numpy as np

df.timestamp = pd.to_datetime(df.timestamp)
df['time_diff'] = np.zeros(len(df))

for i in range(len(df)-1):
    df.time_diff[i 1]=np.abs(df.timestamp[i 1]-df.timestamp[i])/ np.timedelta64(1, 'h')
    
cluster = 0
df['cluster'] = np.zeros(len(df))
for i in range(len(df)):
    if df.time_diff[i]>0.5:
        cluster =1
        df.cluster[i:]=cluster
df

enter image description here

I wonder if there's a Pythonic way of doing this, a one-liner perhaps.

CodePudding user response:

there is a feature precisely for the temporal sequence conversion.
her name is resample.

Below is an example of usage:

import numpy as np
import pandas as pd

start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'

rng = pd.date_range(start, end, freq='7min')
df = pd.Series(np.arange(len(rng)), index=rng)
print("before conversion:")
print(df)
print("after conversion:")
print(df.resample('H').sum())

documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

CodePudding user response:

You may extract the hour column as follows:

    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['hour'] = df['time_stamp'].apply(lambda row: row.hour)
  • Related