I am writing a pandas dataframe as usual to parquet files as usual,
suddenly jump out an exception pyarrow.lib.ArrowInvalid
like this:
List child type string overflowed the capacity of a single chunk,
Conversion failed for column image_url with type object
I am using pyarrow 0.17.0, pandas 1.2.0 I know these are old versions, but I cannot figure out what happened.
What does this "overflowed the capacity of a single chunk" mean?
Does the indicated column image_url
might contain data that brokes the logic?
CodePudding user response:
When writing parquet the dataframe must first be converted to an an arrow table. Columns in an arrow table are chunked arrays. Each chunk is one array.
In arrow, a single string array must contain less than 2GB of data. What should happen (and it may be fixed in newer versions) is that the string array should be converted to a chunked array with many chunks (each chunk containing 2GB) so this shouldn't happen.
If you can't upgrade then you can slice the dataframe yourself and write in pieces.
If you're still getting this error on the latest version you should file a JIRA ticket.
CodePudding user response:
Thanks for @Pace 's answer, this problem is solved after we up graded to the latest version, 5.0.0 .