Home > other >  Querying Glue Partitions through Athena while being overwritten?
Querying Glue Partitions through Athena while being overwritten?

Time:02-19

I have a Glue table on S3 where partitions are populated through Spark save mode overwrite (script executed through Glue job).

What is expected behavior from Athena if we are querying such partitions while they are being overwritten?

CodePudding user response:

New partitions are being picked up by Athena as long as you set enableUpdateCatalog = True when writing. If you just overwrite the content of existing partitions, Athena will be able to query the data, as long as you don't have a schema mismatch.

CodePudding user response:

If you rewrite files while queries are running you may run into errors like "HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file".

The reason is that during query planning all the files are listed on S3, and among other things the file sizes are used to divide up the work between the worker nodes. If a file is splittable, which includes file formats like ORC and Parquet, as well as uncompressed text formats (e.g. JSON, CSV), parts of it (called splits) may be processed by different nodes.

If the file changes between query planning and query execution the plan is no longer valid and the query execution fails.

  • Related