I have a dataframe that have about 300K unique phone numbers that span over a period of about 6 months. What I want to do is to find the numbers that had made 3 or more calls within 1 hour.
I have tried the solution in the below link, but it compares all occurrences with only the first record.
pandas dataframe: duplicates based on column and time range
My dataframe consists of 2 columns, 'Date Time' and 'Phone No'. I'm really new to this, any help will be greatly appreciated!
CodePudding user response:
Let's start with creating some random data we can experiment with (using 3 digit phone numbers in order to get some results at the end):
import pandas as pd
import random
df = pd.DataFrame()
df['Date Time'] = [random.randint(1644307257, 1659859257) for x in range(100000)]
df['Phone No'] = [random.randint(111, 999) for x in range(100000)]
df['Date Time'] = pd.to_datetime(df['Date Time'], unit='s')
now let's focus on a single phone number and create a function that counts how many phone calls where in every rolling 1 hour time window:
def count_in_hour(df0):
df0.sort_values('Date Time', inplace=True)
df0.set_index(df0['Date Time'], inplace=True)
df0['count_in_hour'] = df0.rolling('1h').count()['Phone No']
df0.reset_index(drop=True, inplace=True)
return df0
now groupby the phone numbers and apply the function to each group (this is a heavy loop, could take a couple of minutes):
df_with_count = df.groupby('Phone No').apply(count_in_hour)
now we can find the ones with 3 phone calls in one hour:
df_with_count[df_with_count['count_in_hour']>=3]