Home > Blockchain >  Converting string to date format 0000-00-00 before extracting year and month information
Converting string to date format 0000-00-00 before extracting year and month information

Time:11-18

I have a string column (object type):

Date
2020-06-15
2019-07-23
Data non available

How can I convert the string Data non available and/or any missing values to the format 0000-00-00, before extracting the month and the year?

I can convert the Date column to datetime, and extract information on year and month from the other rows, but then I get this error:

ParserError: Unknown string format: Data not available

My code:

df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

Expected output:

Date             Date_Year        Date_month
2020-06-15          2020              06
2019-07-23          2019              07
Data non available  0000              00

CodePudding user response:

I'd replace Data not available with NaT (Not A Time - NaN equivalent for dates/times) before you call to_datetime:

df['Date'] = df['Date'].replace({'Data non available': pd.NaT})

Then

df['Date'] = pd.to_datetime(df['Date'])
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

Output:

>>> df
        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN

CodePudding user response:

Do you really want to convert the 'Data non available' to the format '0000-00-00'? pandas.to_datetime will not like that. I'm assuming that you just want to mark them as missing.

In that case you can use pandas.to_datetime with errors='coerce'. From the docs

errors {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

  • If ‘raise’, then invalid parsing will raise an exception.
  • If ‘coerce’, then invalid parsing will be set as NaT.
  • If ‘ignore’, then invalid parsing will return the input.
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

>>> df

        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN 

However, you may have dates that are not 'Data non available' but couldn't be correctly parsed for some reason. By using errors = 'coerce' you wouldn't know that and couldn't identify the problem because they would be automatically converted to NaTs (Not a Time). To make sure you only convert 'Data non available' to NaTs, you can mask only the remaining values before converting to datetime

mask = df['Date'] == 'Data non available'
df['Date'] = pd.to_datetime(df['Date'].mask(mask))
df['Date_Year'], df['Date_month'] = df['Date'].dt.year, df['Date'].dt.month

>>> df

        Date  Date_Year  Date_month
0 2020-06-15     2020.0         6.0
1 2019-07-23     2019.0         7.0
2        NaT        NaN         NaN 
  • Related