Home > Software engineering >  Pandas giving me different outputs when compared to Python datetime
Pandas giving me different outputs when compared to Python datetime

Time:10-20

This format parameter in pandas

As here it is referencing to input format same as python strftime.

Using python strptime

import datetime
date = '19981128'
date_python = datetime.datetime.strptime(date, '%Y/%m/%d')

output:

ValueError: time data '19981128' does not match format '%Y/%m/%d'

Using same with pandas

import pandas as pd
import datetime

date = '19981128'
date_pandas = pd.to_datetime(date, errors='coerce', format='%Y/%m/%d')
print(date_pandas)

output

1998-11-28 00:00:00

pandas Documention clearly says datetime module in pandas replicates behaviour of python strptime. Which is not happening here?

CodePudding user response:

pandas has a "special fast-path for iso8601 formatted datetime strings". ISO 8601 specifies YMD order for a date.

From pandas\core\tools\datetimes.py:

    if format is not None:
        # There is a special fast-path for iso8601 formatted
        # datetime strings, so in those cases don't use the inferred
        # format because this path makes process slower in this
        # special case
        format_is_iso8601 = format_is_iso(format)
        if format_is_iso8601:
            require_iso8601 = not infer_datetime_format
            format = None

When the provided format passes the format_is_iso() test the fast-path code is used and (through experimentation) allows slash, hyphen or nothing between the YMD. You can also see below that if YMD isn't used the code will fail:

import pandas as pd
print(repr(pd.to_datetime('19981106',format='%Y/%m/%d')))
print(repr(pd.to_datetime('1998/11/06',format='%Y/%m/%d')))
print(repr(pd.to_datetime('1998-11-06',format='%Y/%m/%d')))
print(repr(pd.to_datetime('11061998',format='%m/%d/%Y')))   # not using fast-path, more strict

Output:

Timestamp('1998-11-06 00:00:00')
Timestamp('1998-11-06 00:00:00')
Timestamp('1998-11-06 00:00:00')
Traceback (most recent call last):
   ...
ValueError: time data '11061998' does not match format '%m/%d/%Y' (match)

I don't see a way to force an exact match of the format, unfortunately. There is an exact=True argument but that is already the default.

You could always force strptime:

>>> import pandas as pd
>>> df = pd.DataFrame(['19981128','20220222'])
>>> df
          0
0  19981128
1  20220222
>>> from datetime import datetime
>>> df[0].apply(lambda x: datetime.strptime(x, '%Y/%m/%d')) # now it fails
Traceback (most recent call last):
    ...
ValueError: time data '19981128' does not match format '%Y/%m/%d'
>>> df[0].apply(lambda x: datetime.strptime(x, '%Y%m%d'))
0   1998-11-28
1   2022-02-22
Name: 0, dtype: datetime64[ns]

CodePudding user response:

The documentation says:

Returns

datetime

If parsing succeeded. Return type depends on input (types in parenthesis correspond to fallback in case of unsuccessful timezone or out-of-range timestamp parsing):

scalar: Timestamp (or datetime.datetime)

(emphasis mine)

and

Notes

Many input types are supported, and lead to different output types:

scalars can be int, float, str, datetime object (from stdlib datetime module or numpy). They are converted to Timestamp when possible, otherwise they are converted to datetime.datetime.

(emphasis mine)

and

The following causes are responsible for datetime.datetime objects being returned [...]

None of the reasons listed there apply to your case.

Your date can be converted into a timestamp, so datetime is not used.

print(pd.Timestamp(date))

Output:

1998-11-28 00:00:00

You can parse with strptime first and pass the result to Pandas, if you like:

import pandas as pd
import datetime
 
date = '19981128'
try:
    date_python = datetime.datetime.strptime(date, '%Y/%m/%d')
    date_pandas = pd.to_datetime(date_python, errors='coerce')
except ValueError:
    date_pandas = pd.NaT
print(date_pandas)
  • Related