I am encountering some issues when using the .between method in Python.
I have a simple dataset consisting of ~59000 records
The date format is in DD/MM/YYYY and I would like to filter the days in the month of April in the year 2014.
psi_df = pd.read_csv('thecsvfile.csv')
psi_west_df = psi_df[['24-hr_psi','west']]
april_records = psi_west_df[psi_west_df['24-hr_psi'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I received the output whereby the date suddenly jumps from 3/4/2014 (3rd April) - 10/4/2014 (10th April). This pattern recurs for every month and for every year up till the year 2020 (the final year of this dataset), which was not my original intention of obtaining the data for the month of April in the year 2014.
As I am still rather new to python, I decided to perform some fixes in Excel instead. I separated the date and the time columns and reran the code with the necessary syntax updated.
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
april_records = psi_west_df[psi_west_df['date'].between('1/4/2014','31/4/2014')]
april_records.head(100)
I still faced the same issue and now, I am totally stumped as to why this is occurring. Am I using the .between method wrongly? Seeking everyone's kind guidance and directions as to why this is occurring. Much appreciated and many thanks everyone.
The csv file that I am using can be obtained from this website:
CodePudding user response:
The first problem is your date column isn't a date but an object column.
Ensure you column is really a date by using the pandas to_datetime
function.
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
After the column is really a date column in order for the between
function to run with no problems you should give it two date object and not string object like this:
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
So all together:
psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')
start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
april_records.head(100)
Note - this code should work on the data after you change it with excel, meaning you have a separate column for data and time.