dataframe with the datetime64[ns] column as parquet using pandas.to_parquet and read the parquet file, the datetime64[ns] column is converted to unixtimestamp.
eg. 2022-10-05 19:31:57.894835 -> 1664998317894835000
Is it not possible to save the datetime64[ns] column as it is?
CodePudding user response:
datetime64[ns]
format of a pd.DataFrame is a dtype specific of Pandas, or to be more precise, of NumPy.
This is not comparable to the types supported by the parquet-format types source: Apache's official parquet docs.
You should also check which engine you are using to generate the parquet file. According to pandas API reference of the to_parquet, if not explicitly specified, it probably defaults to pyarrow
.
If pyarrow
is your engine, then this type differences are holding:
https://arrow.apache.org/docs/python/pandas.html#type-differences
Always arrow documentation suggest the proper handing:
If you want to use NumPy’s datetime64 dtype instead, pass
date_as_object=False
:
In [26]: s2 = pd.Series(arr.to_pandas(date_as_object=False))
In[27]: s2.dtype
Out[27]: dtype('<M8[ns]')
Bonus track >> If reading / reloading of is performed in Spark, you can later use datetime functions in order to convert the unix timestamp, spark-sql datetime.