I am parsing the dates of my dataset, but am encountering a lot of ParserError
because the hours are often in the wrong format. I've decided to skip the hours and only focus on Years, Months, Days
These are the variants I have for date:
| Startdate |
| --- |
| March 23, 2022 6:00 |
| March 23, 2022 7:0 |
| March 23, 2022 7: |
| March 23, 2022 7 |
For now, only the first date/row works for parsing data. I currently skip the other rows, however I would want to also include them by just excluding the hours.
for date in df_en['Startdate']:
try:
parse(date).date()
except Exception:
pass
What is the right way to still parse the other dates without having to bother with hours?
I've tried to convert the time into a valid hours format. using pd.to_datetime
did not work because the time format was a str march
not number 3
. When manually changed towards 3, it still gave the error ValueError: unconverted data remains: :00
. Therefore with no relevancy for hours, I just wanted to skip it.
dates = ['December 1, 2021 6:00', 'March 23, 2022 6']
for date in dates:
date.replace(' (\d{1})', ' 0\\1')
pd.to_datetime(date, format='%m %d, %Y %H')
print(date)
Endgoal:
| Year | Month | Day |
| --- | --- | --- |
| 2022 | March | 23 |
| 2022 | March | March |
CodePudding user response:
I guess you can just dump the hour part
dates = ['March 23, 2022 6:00', 'March 23, 2022 7:0', 'March 23, 2022 7:', 'March 23, 2022 7']
pd.to_datetime([' '.join(x.split(' ')[:-1]) for x in dates])
DatetimeIndex(['2022-03-23', '2022-03-23', '2022-03-23', '2022-03-23'], dtype='datetime64[ns]', freq=None)
after that you can do df['date'].dt.year
to extract year, month, day
CodePudding user response:
If you just need year/month/day columns, there's actually no need to parse to datetime. Just deal with the strings by splitting and rearranging; EX:
import pandas as pd
df = pd.DataFrame({'Startdate': ['December 1, 2021 6:00', 'March 23, 2022 6']})
parts = df['Startdate'].str.split('\ |, ')
df['year'], df['month'], df['day'] = parts.str[2], parts.str[0], parts.str[1]
print(df)
# Startdate year month day
# 0 December 1, 2021 6:00 2021 December 1
# 1 March 23, 2022 6 2022 March 23