Pyarrow timestamp keeps converting to 1970-CodePudding

I'm trying to store a timestamp with all the other data in my dataframe, signifying the time the data was stored to disk, in a Parquet file. Normally I'd just store the timestamp within the pandas dataframe itself, but pyarrow doesn't like pandas' way of storing timestamps and complains that it will lose precision converting from nanoseconds to microseconds when I run pa.Table.from_pandas() no matter what I do. A workaround is to directly append the timestamp as a column within the table, however for some reason pyarrow keeps converting the timestamp to 1970. I have tried multiple workarounds but nothing seems to work.

See below, a working code example replicating the issue. The append isn't actually done to the table in this example, but it shows the issue - the timestamp returned by datetime.now().timestamp() is correct, but when it's converted to a pyarrow array it resets to 1970.

from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd

data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)

ts = datetime.now().timestamp()
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('us'))
print('PyArrow timestamp:', pa_ts)

And here's the output I get:

DateTime timestamp: 1650817852.093818
PyArrow timestamp: [
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852
]

CodePudding user response：

As FObersteiner mentioned, the issue here was because I was telling pyarrow to convert from an assumed microsecond-level timestamp. In case anyone encounters this issue in the future, it's as simple as changing the 'us' above to 's'. And if you want millisecond-level timestamping, you can do it like so:

from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd

data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)

ts = datetime.now().timestamp()*1000
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('ms'))
print('PyArrow timestamp:', pa_ts)