I want to filter rows on the basis of each deviceIds and do some operation on those rows. (Pandas)-CodePudding

I want to check for time gaps in test_date_time column:

device_id           eventDateTime      firstPacketDateTime  lastPacketDateTime  test_date_time
MC2C5HRC0KF445779   27/12/2021 0:45    27/12/2021 0:45      27/12/2021 0:59     0:45
MC2C5HRC0KF445779   27/12/2021 1:00    27/12/2021 1:00      27/12/2021 1:13     1:00
MC2C5HRC0KF445779   27/12/2021 1:15    27/12/2021 1:16      27/12/2021 1:29     1:15
MC2C5HRC0KF445779   27/12/2021 1:30    27/12/2021 1:30      27/12/2021 1:42     1:30
MC2H3JRC0LC178168   27/12/2021 10:45   27/12/2021 10:46     27/12/2021 10:56    10:45
MC2H3JRC0LC178168   27/12/2021 11:15   27/12/2021 11:15     27/12/2021 11:25    11:15
MC2H3JRC0LC178168   27/12/2021 11:30   27/12/2021 11:35     27/12/2021 11:35    11:30

Time should be increasing from 0:00 to 23:45 with each index of the column have a value with gap of 15 mins. Eg:

test_date_time
0:00
0:15
0:30
0:45
1:00
1:15
1:30

I want to separate out those rows having time gap of more than 15 mins and store them inside another dataframe.

How can I implement this without using a for loop?

CodePudding user response：

df = pd.read_csv(r"C:\...\input.csv")

# Creating new column with the type casted times
df["test_dt"] = pd.to_timedelta(df['test_date_time'] ":00")

# Another new column with just the difference of the consecutive values
df["delta_test_dt"] = df["test_dt"].diff()

# Now creating a timedelta object for the 15 minutes gap
delta = pd.to_timedelta("15m")

# Filtering the dataframe
new_df = df[df["delta_test_dt"] > delta]

Ref: https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html

This will store rows 4 and 5 in the new data frame (indexing from 0).

CodePudding user response：

Convert eventDateTime (or test_date_time) to datetime types. Then use df.shift() to create a new column prev which is the evenDataTime's previous value. Then filter on those which are more than 15 mins apart. (Note that you should probably do this by device_id and not all rows together.)

import datetime

df['eventDateTime'] = pd.to_datetime(df['eventDateTime'])
df['prev'] = df['eventDateTime'].shift()
df[df['eventDateTime'] - df['prev'] > datetime.timedelta(minutes=15)]

Result, which you can assign to a new dataframe:

           device_id       eventDateTime firstPacketDateTime lastPacketDateTime test_date_time                prev
4  MC2H3JRC0LC178168 2021-12-27 10:45:00    27/12/2021 10:46   27/12/2021 10:56          10:45 2021-12-27 01:30:00
5  MC2H3JRC0LC178168 2021-12-27 11:15:00    27/12/2021 11:15   27/12/2021 11:25          11:15 2021-12-27 10:45:00

If you don't want a new column prev, then drop it after or:

df['eventDateTime'] = pd.to_datetime(df['eventDateTime'])
df[df['eventDateTime'] - df['eventDateTime'].shift() > datetime.timedelta(minutes=15)]

(that can also be reduced to one very unreadable line:

df[pd.to_datetime(df['eventDateTime']) - pd.to_datetime(df['eventDateTime']).shift() > datetime.timedelta(minutes=15)]

And with test_date_time column:

df[pd.to_datetime(df['test_date_time']) - pd.to_datetime(df['test_date_time']).shift() > \
   datetime.timedelta(minutes=15)]