How do I specify a dtype for all columns when reading a CSV file with pyarrow?-CodePudding

I wanna read a big CSV file with pyarrow. All my columns are float64's. But pyarrow seems to be inferring int64.

How do I specify a dtype for all columns?

import gcsfs
import pyarrow.dataset as ds

fs = gcsfs.GCSFileSystem(project='my-google-cloud-project')

my_dataset = ds.dataset("bucket/foo/bar.csv", format="csv", filesystem=fs)

my_dataset.to_table()

which produces:

ArrowInvalid                              Traceback (most recent call last)
........py in <module>
----> 65 my_dataset.to_table()

File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:491, in pyarrow._dataset.Dataset.to_table()

File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:3235, in pyarrow._dataset.Scanner.to_table()

File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:99, in pyarrow.lib.check_status()

ArrowInvalid: In CSV column #172: Row #28: CSV conversion error to int64: invalid value '6.58841482364418'

CodePudding user response：

Pyarrow's dataset module reads CSV files in chunks (the default is 1MB I think) and it processes those chunks in parallel. This makes column inference a bit tricky and it handles this by using the first chunk to infer data types. So the error you are getting is very common when the first chunk of the file has a column that looks integral but in future chunks the column has decimal values.

If you know the column names in advance then you can specify the data types of the columns:

import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds

column_types = {'a': pa.float64(), 'b': pa.float64(), 'c': pa.float64()}
convert_options = csv.ConvertOptions(column_types=column_types)
custom_csv_format = ds.CsvFileFormat(convert_options=convert_options)
dataset = ds.dataset('/tmp/foo.csv', format=custom_csv_format)

If you don't know the column names then things are a bit trickier. However, it sounds like ALL columns are float64. In that case, since you only have one file, you can probably do something like this as a workaround:

dataset = ds.dataset('/tmp/foo.csv', format='csv')
column_types = {}
for field in dataset.schema:
  column_types[field.name] = pa.float64()
# Now use column_types as above

This works because we call pa.dataset(...) twice and it will have a small bit of overhead. This is because each time we call pa.dataset(...) pyarrow will open the first chunk of the first file in the dataset to determine the schema (this is why we can use dataset.schema)

If you have multiple files with different columns then this approach won't work. In that case I'd recommend mailing the Arrow user@ mailing list and we can have a more general discussion about different ways to solve the problem.