Home > Enterprise >  python3 - to_parquet data format
python3 - to_parquet data format

Time:10-05

dataframe with the datetime64[ns] column as parquet using pandas.to_parquet and read the parquet file, the datetime64[ns] column is converted to unixtimestamp.

eg. 2022-10-05 19:31:57.894835 -> 1664998317894835000

Is it not possible to save the datetime64[ns] column as it is?

CodePudding user response:

datetime64[ns] format of a pd.DataFrame is a dtype specific of Pandas, or to be more precise, of NumPy.

This is not comparable to the types supported by the parquet-format types source: Apache's official parquet docs.

You should also check which engine you are using to generate the parquet file. According to pandas API reference of the to_parquet, if not explicitly specified, it probably defaults to pyarrow.

If pyarrow is your engine, then this type differences are holding:

https://arrow.apache.org/docs/python/pandas.html#type-differences

Always arrow documentation suggest the proper handing:

If you want to use NumPy’s datetime64 dtype instead, pass date_as_object=False:

In [26]: s2 = pd.Series(arr.to_pandas(date_as_object=False))

In[27]:  s2.dtype
Out[27]: dtype('<M8[ns]')

Bonus track >> If reading / reloading of is performed in Spark, you can later use datetime functions in order to convert the unix timestamp, spark-sql datetime.

  • Related