Home > database >  Python Data Series .Between function not working as expected
Python Data Series .Between function not working as expected

Time:12-26

I am encountering some issues when using the .between method in Python.

I have a simple dataset consisting of ~59000 records

enter image description here

The date format is in DD/MM/YYYY and I would like to filter the days in the month of April in the year 2014.

psi_df = pd.read_csv('thecsvfile.csv')
psi_west_df = psi_df[['24-hr_psi','west']]
april_records = psi_west_df[psi_west_df['24-hr_psi'].between('1/4/2014','31/4/2014')]
april_records.head(100)

enter image description here

I received the output whereby the date suddenly jumps from 3/4/2014 (3rd April) - 10/4/2014 (10th April). This pattern recurs for every month and for every year up till the year 2020 (the final year of this dataset), which was not my original intention of obtaining the data for the month of April in the year 2014.

As I am still rather new to python, I decided to perform some fixes in Excel instead. I separated the date and the time columns and reran the code with the necessary syntax updated. enter image description here

psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
april_records = psi_west_df[psi_west_df['date'].between('1/4/2014','31/4/2014')]
april_records.head(100)

I still faced the same issue and now, I am totally stumped as to why this is occurring. Am I using the .between method wrongly? Seeking everyone's kind guidance and directions as to why this is occurring. Much appreciated and many thanks everyone.

The csv file that I am using can be obtained from this website: enter image description here

CodePudding user response:

The first problem is your date column isn't a date but an object column.

Ensure you column is really a date by using the pandas to_datetime function.

psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')  

After the column is really a date column in order for the between function to run with no problems you should give it two date object and not string object like this:

start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]

So all together:

psi_df = pd.read_csv('psi_new.csv')
psi_west_df = psi_df[['date','west']]
psi_west_df['date'] = pd.to_datetime(psi_west_df['date'], format='%d/%m/%Y')

start_day = pd.to_datetime('1/4/2014', format='%d/%m/%Y')
end_day = pd.to_datetime('30/4/2014', format='%d/%m/%Y')
april_records = psi_west_df[psi_west_df['date'].between(start_day, end_day)]
april_records.head(100)

Note - this code should work on the data after you change it with excel, meaning you have a separate column for data and time.

  • Related