Home > Blockchain >  Pandas Date Formatting (With Optional Milliseconds)
Pandas Date Formatting (With Optional Milliseconds)

Time:06-23

I'm getting data from an API and putting it into a Pandas DataFrame. The date column needs formatting into date/time, which I am doing. However the API sometimes returns dates without milliseconds which doesn't match the format pattern. This results in an error:

time data '2020-07-30T15:57:37Z' does not match format '%Y-%m-%dT%H:%M:%S.%fZ' (match)

In this example, how can I format the date column to date/time, so all dates are formatted with milliseconds?

import pandas as pd

dates = {
    'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}

df = pd.DataFrame(dates)

df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ')

print(df)

CodePudding user response:

do it one time with milliseconds included and another time without milliseconds included. use errors='coerce' to return NaT when ValueError occurs.

with_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%S.%fZ',errors='coerce')
without_miliseconds = pd.to_datetime(df['date'], format='%Y-%m-%dT%H:%M:%SZ',errors='coerce')

the results would be something like this:

with milliseconds:
0                       NaT
1   2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]

without milliseconds:
0   2020-07-30 15:57:37
1                   NaT
Name: date, dtype: datetime64[ns]

then you can fill NaTs of one dataframe with values of the other because they complement each other.

with_miliseconds.fillna(without_miliseconds)
0   2020-07-30 15:57:37.000
1   2020-07-30 15:57:37.100
Name: date, dtype: datetime64[ns]

CodePudding user response:

As your date string seems like the standard ISO 8601 you can just avoid the use of the format param. The parser will take into account that miliseconds are optional.

import pandas as pd

dates = {
    'date': ['2020-07-30T15:57:37Z', '2020-07-30T15:57:37.1Z']
}

df = pd.DataFrame(dates)

df['date'] = pd.to_datetime(df['date'])

print(df)

CodePudding user response:

To have a consistent format in your output DataFrame, you could run a Regex replacement before converting to a df for all values without mills.

dates = {'date': [re.sub(r'Z', '.0Z', date) if '.' not in date else date for date in dates['date']]}

Since only those dates containing a . have mills, we can run the replacements on the others.
After that, everything else is the same as in your code.
Output:

                     date
0 2020-07-30 15:57:37.000
1 2020-07-30 15:57:37.100
  • Related