Pandas read_parquet() Error: pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] w-CodePudding

I am trying to read the 02-2019 fhv data in parquet format found here

https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet

However when I try to read the data with Pandas

df = pd.read_parquet('fhv_tripdata_2019-02.parquet')

It throws the error:

  File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000

Does anyone know how to print out the offending rows or coerce these values? Make it ignore these rows?

CodePudding user response：

One of the row in that data set has got its dropOff set to 3019-02-03 17:30:00.000000. This is out of bound for pandas.Timestamp. I think it was meant to be 2019-02-03 17:30:00.000000.

One option is to ignore that error:

import pyarrow.parquet as pq

df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)

But then that wrong timestamp will overflow and have some weird value:

>>> df['dropOff_datetime'].min()
Timestamp('1849-12-25 18:20:52.580896768')

Alternatively you can filter out the values that are out of bound in pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc

table = pq.read_table("fhv_tripdata_2019-02.parquet")
df = table.filter(
    pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()