I have a string column (object type):
Date
2020-06-15
2019-07-23
Data non available
How can I convert the string Data non available
and/or any missing values to the format 0000-00-00
, before extracting the month and the year?
I can convert the Date
column to datetime
, and extract information on year and month from the other rows, but then I get this error:
ParserError: Unknown string format: Data not available
My code:
df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month
Expected output:
Date Date_Year Date_month
2020-06-15 2020 06
2019-07-23 2019 07
Data non available 0000 00
CodePudding user response:
I'd replace Data not available
with NaT (Not A Time - NaN equivalent for dates/times) before you call to_datetime
:
df['Date'] = df['Date'].replace({'Data non available': pd.NaT})
Then
df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month
Output:
>>> df
Date Date_Year Date_month
0 2020-06-15 2020.0 6.0
1 2019-07-23 2019.0 7.0
2 NaT NaN NaN
CodePudding user response:
Do you really want to convert the 'Data non available'
to the format '0000-00-00'
? pandas.to_datetime
will not like that. I'm assuming that you just want to mark them as missing.
In that case you can use pandas.to_datetime
with errors='coerce'
. From the docs
errors {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
- If ‘raise’, then invalid parsing will raise an exception.
- If ‘coerce’, then invalid parsing will be set as NaT.
- If ‘ignore’, then invalid parsing will return the input.
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month
>>> df
Date Date_Year Date_month
0 2020-06-15 2020.0 6.0
1 2019-07-23 2019.0 7.0
2 NaT NaN NaN
However, you may have dates that are not 'Data non available'
but couldn't be correctly parsed for some reason. By using errors = 'coerce'
you wouldn't know that and couldn't identify the problem because they would be automatically converted to NaTs (Not a Time). To make sure you only convert 'Data non available'
to NaTs, you can mask only the remaining values before converting to datetime
mask = df['Date'] == 'Data non available'
df['Date'] = pd.to_datetime(df['Date'].mask(mask))
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month
>>> df
Date Date_Year Date_month
0 2020-06-15 2020.0 6.0
1 2019-07-23 2019.0 7.0
2 NaT NaN NaN