I am trying to read the 02-2019 fhv data in parquet format found here
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2019-02.parquet
However when I try to read the data with Pandas
df = pd.read_parquet('fhv_tripdata_2019-02.parquet')
It throws the error:
File "pyarrow/table.pxi", line 1156, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 33106123800000000
Does anyone know how to print out the offending rows or coerce these values? Make it ignore these rows?
CodePudding user response:
One of the row in that data set has got its dropOff set to 3019-02-03 17:30:00.000000
. This is out of bound for pandas.Timestamp
. I think it was meant to be 2019-02-03 17:30:00.000000
.
One option is to ignore that error:
import pyarrow.parquet as pq
df = pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)
But then that wrong timestamp will overflow and have some weird value:
>>> df['dropOff_datetime'].min()
Timestamp('1849-12-25 18:20:52.580896768')
Alternatively you can filter out the values that are out of bound in pyarrow:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
table = pq.read_table("fhv_tripdata_2019-02.parquet")
df = table.filter(
pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()