Count consecutive rows with minutes difference in dataframe-CodePudding

5I have a dataframe that looks like this:

Name	Site	Time
Manual	BCN	3/10/2022 11:23:13 PM
Manual	BCN	3/10/2022 11:38:47 PM
Automatic	Madrid	3/10/2022 11:40:32 PM
Manual	BCN	3/10/2022 11:39:47 PM
Manual	BCN	3/11/2022 12:44:47 AM

It consists of a Name column, Place and Time. What I'm looking for is to count where Name and place are equal and Time is less than 20minutes between instances. In this case output would be Manual,bcn1 ->3 times as the 5th row is an hour away from the other two. The data is sorted by Time.

What I have tried is to groupby with the Name and Place and then apply a diff to Time with no avail.

df['Time'] = pd.to_datetime(df['Time'])
g=( df.groupby(['site','Name'])['Time'].diff().ne(pd.Timedelta(minutes=20))
      .groupby(df['site','Ppath']).cumsum() )
groups = df.groupby(['Site',g])['Time']
new_df = df.assign(count = groups.transform('size'))

This is returning the count of all values not the ones that fulfill the timedelta. The file itself is quite big with multiple Name and site places.

Many thanks

Edit1. To clarify I'm looking at value pairs so in this case the first row with the second one. And then the second one with the third one and so on. I'm exploring a solution with a For filtering by Name and site.

Thanks

CodePudding user response：

IIUC, try:

df["Time"] = pd.to_datetime(df["Time"])
df = df.sort_values("Time", ignore_index=True)

output = (df.groupby(["Name", "Site"])["Time"].apply(lambda x: x.diff()
                                                                .dt
                                                                .total_seconds()
                                                                .div(60)
                                                                .fillna(0)
                                                                .le(20)
                                                                .sum()
                                                    )
          )

>>> output
Name       Site  
Automatic  Madrid    1
Manual     BCN       3
Name: Time, dtype: int64