A common source of errors in my Python codebase are dates. Specifically, the different implementations of dates and datetimes, and how comparisons are handled between them.
These are the date types in my codebase
import datetime
import pandas as pd
import polars as pl
x1 = pd.to_datetime('2020-10-01')
x2 = datetime.datetime(2020, 10,1)
x3 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Date)).to_numpy()[0,0]
x4 = pl.DataFrame({'i':[x2]}).select(pl.col('i').cast(pl.Datetime)).to_numpy()[0,0]
x5 = pendulum.parse('2020-10-01')
x6 = x5.date()
x7 = x1.date()
You can print them to see:
x1=2020-10-01 00:00:00 , type(x1)=<class 'pandas._libs.tslibs.timestamps.Timestamp'>
x2=2020-10-01 00:00:00 , type(x2)=<class 'datetime.datetime'>
x3=2020-10-01 , type(x3)=<class 'numpy.datetime64'>
x4=2020-10-01T00:00:00.000000 , type(x4)=<class 'numpy.datetime64'>
x5=2020-10-01T00:00:00 00:00 , type(x5)=<class 'pendulum.datetime.DateTime'>
x6=2020-10-01 , type(x6)=<class 'pendulum.date.Date'>
x7=2020-10-01 , type(x7)=<class 'datetime.date'>
Is there a canonical date representation in Python? I suppose x7: datetime.date
is probably closest...
Also, note comparisons are a nightmare, see here a table of trying to do xi == xj
x1 | x2 | x3 | x4 | x5 | x6 | x7 | |
---|---|---|---|---|---|---|---|
x1: <class 'pandas._libs.tslibs.timestamps.Timestamp'> | True | True | ERROR: Only resolutions 's', 'ms', 'us', 'ns' are supported. | True | False | True | True |
x2: <class 'datetime.datetime'> | True | True | False | True | False | False | False |
x3: <class 'numpy.datetime64'> | True | False | True | True | False | True | True |
x4: <class 'numpy.datetime64'> | True | True | True | True | False | False | False |
x5: <class 'pendulum.datetime.DateTime'> | False | False | False | False | True | False | False |
x6: <class 'pendulum.date.Date'> | True | True | True | False | False | True | True |
x7: <class 'datetime.date'> | True | False | True | False | False | True | True |
Also note it's not even symmetric:
The pain is that comparisons are even stranger. Here is xi>=xj:
Red represents an ERROR
:
As you can imagine, there is an ever growing ammount of glue code to keep this under control. Is there any advice on how to handle date & datetime types in Python?
For simplicity:
- I never need timezone data, everything should always be UTC
- Sometimes dates are passed around as strings for convenience (eg. parsed from a JSON)
- I at most need seconds resolution, but 99% of my work uses only dates.
CodePudding user response:
All listed types can be converted to numpy datetime64. If you don't need more than seconds resolution, you might set the unit to 's' (optional). Ex:
# Python datetime.datetime
x2_np = np.datetime64(x2.replace(tzinfo=None), 's')
print(x2_np, repr(x2_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')
# Python datetime.date
x6_np = np.datetime64(x6, 's')
print(x6_np, repr(x6_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')
# pendulum datetime
x5_np = np.datetime64(x5.replace(tzinfo=None), 's')
print(x5_np, repr(x5_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')
# pd.Timestamp
x1_np = x1.to_numpy().astype('datetime64[s]')
print(x1_np, repr(x1_np))
# 2020-10-01T00:00:00 numpy.datetime64('2020-10-01T00:00:00')
Since numpy tries to avoid time zones (defaults to UTC), make sure to replace the tzinfo
for datetime.datetime and pendulum.datetime, should it be set there.
Now you could put this all in one converter function that is essentially a big switch case. Use with caution on big datasets however, convenience does not come for free most of the time. Ex:
def convert_dt_to_numpy(dt, unit='s'):
if isinstance(dt, (datetime.datetime, pendulum.DateTime)):
return np.datetime64(dt.replace(tzinfo=None), unit)
if isinstance(dt, (datetime.date, pendulum.Date)):
return np.datetime64(dt, unit)
if isinstance(dt, pd.Timestamp):
return dt.to_numpy().astype(f'datetime64[{unit}]')
raise ValueError(f"conversion for '{dt}' of {type(dt)} unknown")
for dt in (x1, x2, x6, x5, 7):
print(convert_dt_to_numpy(dt))