filtering duplicates within specific timeframe-CodePudding

I have a dataframe that have about 300K unique phone numbers that span over a period of about 6 months. What I want to do is to find the numbers that had made 3 or more calls within 1 hour.

I have tried the solution in the below link, but it compares all occurrences with only the first record.

pandas dataframe: duplicates based on column and time range

My dataframe consists of 2 columns, 'Date Time' and 'Phone No'. I'm really new to this, any help will be greatly appreciated!

CodePudding user response：

Let's start with creating some random data we can experiment with (using 3 digit phone numbers in order to get some results at the end):

import pandas as pd
import random
df = pd.DataFrame()
df['Date Time'] = [random.randint(1644307257, 1659859257) for x in range(100000)]
df['Phone No'] = [random.randint(111, 999) for x in range(100000)]
df['Date Time'] = pd.to_datetime(df['Date Time'], unit='s')

now let's focus on a single phone number and create a function that counts how many phone calls where in every rolling 1 hour time window:

def count_in_hour(df0):
    df0.sort_values('Date Time', inplace=True)
    df0.set_index(df0['Date Time'], inplace=True)
    df0['count_in_hour'] = df0.rolling('1h').count()['Phone No']
    df0.reset_index(drop=True, inplace=True)
    return df0

now groupby the phone numbers and apply the function to each group (this is a heavy loop, could take a couple of minutes):

df_with_count = df.groupby('Phone No').apply(count_in_hour)

now we can find the ones with 3 phone calls in one hour:

df_with_count[df_with_count['count_in_hour']>=3]