5I have a dataframe that looks like this:
Name | Site | Time |
---|---|---|
Manual | BCN | 3/10/2022 11:23:13 PM |
Manual | BCN | 3/10/2022 11:38:47 PM |
Automatic | Madrid | 3/10/2022 11:40:32 PM |
Manual | BCN | 3/10/2022 11:39:47 PM |
Manual | BCN | 3/11/2022 12:44:47 AM |
It consists of a Name column, Place and Time. What I'm looking for is to count where Name and place are equal and Time is less than 20minutes between instances. In this case output would be Manual,bcn1 ->3 times as the 5th row is an hour away from the other two. The data is sorted by Time.
What I have tried is to groupby with the Name and Place and then apply a diff to Time with no avail.
df['Time'] = pd.to_datetime(df['Time'])
g=( df.groupby(['site','Name'])['Time'].diff().ne(pd.Timedelta(minutes=20))
.groupby(df['site','Ppath']).cumsum() )
groups = df.groupby(['Site',g])['Time']
new_df = df.assign(count = groups.transform('size'))
This is returning the count of all values not the ones that fulfill the timedelta. The file itself is quite big with multiple Name and site places.
Many thanks
Edit1. To clarify I'm looking at value pairs so in this case the first row with the second one. And then the second one with the third one and so on. I'm exploring a solution with a For filtering by Name and site.
Thanks
CodePudding user response:
IIUC, try:
df["Time"] = pd.to_datetime(df["Time"])
df = df.sort_values("Time", ignore_index=True)
output = (df.groupby(["Name", "Site"])["Time"].apply(lambda x: x.diff()
.dt
.total_seconds()
.div(60)
.fillna(0)
.le(20)
.sum()
)
)
>>> output
Name Site
Automatic Madrid 1
Manual BCN 3
Name: Time, dtype: int64