Does the unit passed to the datetime64 data type in pandas do anything?
Consider this code:
import pandas as pd
v1 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64'})
v2 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[ns]'})
v3 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[ms]'})
v4 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[s]'})
v5 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[h]'})
v6 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[D]'})
v7 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[M]'})
v8 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[Y]'})
for v in [v1,v2,v3,v4,v5,v6,v7,v8]:
x = v.iloc[0,0]
print(x, type(x), x.to_datetime64(), v.memory_usage()['Date'])
It returns:
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
CodePudding user response:
First of all: The Pandas version of the datetime64
type only timezone support. Specifically, when you try to a datetime64
variant in a Pandas series, it'll only support as
(attosecond), fs
(femtosecond), ps
(picosecond) and ns
(nanosecond) resolutions, anything less precise is replaced by datetime64[ns]
. The datetime64[<res>, <tz>]
variant only accepts s
(seconds), ms
(milliseconds), us
(microseconds) and ns
resolutions. Don't confuse these with the numpy
datetime64 type.
For both Pandas and Numpy, the 2-letter abbreviation determines the resolution used to record the timestamps, and because the type is always stored as 64 bits, it determines the range of values you can store in it. It does not alter how much memory the type takes!
From the numpy datetime64
Datetime Units documentation:
Datetimes are always stored with an epoch of 1970-01-01T00:00. This means the supported dates are always a symmetric interval around the epoch, called “time span” in the table below.
The length of the span is the range of a 64-bit integer times the length of the date or unit. For example, the time span for ‘W’ (week) is exactly 7 times longer than the time span for ‘D’ (day), and the time span for ‘D’ (day) is exactly 24 times longer than the time span for ‘h’ (hour).
Your experiment won't show any difference in memory use, because the amount of memory doesn't change, only the resolution.
Because Pandas wraps the numpy datetime64
type, and you can't actually create a series with anything other than datetime64[ns]
; e.g. the DateTimeIndex
dtype
parameter is documented as accepting either a numpy.dtype
or DatetimeTZDtype
or str
, default None
, but that for numpy.dtype
there is an additional restriction:
Note that the only NumPy dtype allowed is ‘datetime64[ns]’.
So to demonstrate what the effect of different units, you'd have to use the numpy
type directly:
>>> import numpy as np
>>> for unit in ('Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', 'us', 'ns'): # ps, fs and as have too small a span
... print(unit, np.array(["2021-02-27T12:24:17.524627869"], dtype=f"datetime64[{unit}]"))
...
Y ['2021']
M ['2021-02']
W ['2021-02-25']
D ['2021-02-27']
h ['2021-02-27T12']
m ['2021-02-27T12:24']
s ['2021-02-27T12:24:17']
ms ['2021-02-27T12:24:17.524']
us ['2021-02-27T12:24:17.524627']
ns ['2021-02-27T12:24:17.524627869']