Home > Back-end >  filtering duplicates within specific timeframe
filtering duplicates within specific timeframe

Time:08-12

I have a dataframe that have about 300K unique phone numbers that span over a period of about 6 months. What I want to do is to find the numbers that had made 3 or more calls within 1 hour.

I have tried the solution in the below link, but it compares all occurrences with only the first record.

pandas dataframe: duplicates based on column and time range

My dataframe consists of 2 columns, 'Date Time' and 'Phone No'. I'm really new to this, any help will be greatly appreciated!

CodePudding user response:

Let's start with creating some random data we can experiment with (using 3 digit phone numbers in order to get some results at the end):

import pandas as pd
import random
df = pd.DataFrame()
df['Date Time'] = [random.randint(1644307257, 1659859257) for x in range(100000)]
df['Phone No'] = [random.randint(111, 999) for x in range(100000)]
df['Date Time'] = pd.to_datetime(df['Date Time'], unit='s')

now let's focus on a single phone number and create a function that counts how many phone calls where in every rolling 1 hour time window:

def count_in_hour(df0):
    df0.sort_values('Date Time', inplace=True)
    df0.set_index(df0['Date Time'], inplace=True)
    df0['count_in_hour'] = df0.rolling('1h').count()['Phone No']
    df0.reset_index(drop=True, inplace=True)
    return df0

now groupby the phone numbers and apply the function to each group (this is a heavy loop, could take a couple of minutes):

df_with_count = df.groupby('Phone No').apply(count_in_hour)

now we can find the ones with 3 phone calls in one hour:

df_with_count[df_with_count['count_in_hour']>=3]
  • Related