Home > Net >  How can i get the top 10 most frequent values between 2 dates from a csv with pandas?
How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Time:10-01

Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.

import numpy as np
import pandas as pd

input_date1 = 2012/11/1
input_date2 = 2013/11/1


df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))

CodePudding user response:

You can use pandas.Series.between :

df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])

input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')

m = df['OFFENCE_MONTH'].between(input_date1, input_date2)

df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)

CodePudding user response:

You can do this if it is per month:

import pandas as pd

input_date1 = 2012/11/1
input_date2 = 2013/11/1

# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)

# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)

# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))


example result:

a    2
b    1
d    1
r    1
e    1
f    1

CodePudding user response:

First you need to convert the date in OFFENCE_MONTH column to datetime :

from datetime import datetime 

datetime.strptime(input_date1, "%Y-%m-%d") 
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")

Then Selecting rows based on your conditions:

rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))
  • Related