Home > Net >  Pandas read_parquet() Error: pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] w
Pandas read_parquet() Error: pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] w

Time:11-18

I am trying to read the 02-2019 fhv data in parquet format found here

https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet

However when I try to read the data with Pandas

df = pd.read_parquet('fhv_tripdata_2019-02.parquet')

It throws the error:

  File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000

Does anyone know how to print out the offending rows or coerce these values? Make it ignore these rows?

CodePudding user response:

One of the row in that data set has got its dropOff set to 3019-02-03 17:30:00.000000. This is out of bound for pandas.Timestamp. I think it was meant to be 2019-02-03 17:30:00.000000.

One option is to ignore that error:

import pyarrow.parquet as pq

df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)

But then that wrong timestamp will overflow and have some weird value:

>>> df['dropOff_datetime'].min()
Timestamp('1849-12-25 18:20:52.580896768')

Alternatively you can filter out the values that are out of bound in pyarrow:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc

table = pq.read_table("fhv_tripdata_2019-02.parquet")
df = table.filter(
    pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()
  • Related