Home > OS >  How to ignore empty parquet files when reading using Hive
How to ignore empty parquet files when reading using Hive

Time:04-18

I am using Hive 3.1.0 and my query reads a bunch of parquet files from certain path every hour. I don't have control over how these files are generated as these are created by some external process. In some rare case it happens that within the specified path a certain parquet file may exist with zero size. I would like Hive to ignore this but my hive queries fail with the following error:-

<filename>.parquet is not a Parquet file (too small length: 0)

How do I avoid this ? There could be too many files landing in an hour , so it would be an overkill to create automation to detect and delete empty files. I believe there should be some simpler option in Hive to make it just ignore such files.

CodePudding user response:

Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you can provide the query as how you are trying to access.

CodePudding user response:

I don't know how to do this as hive property. If ever, you might want to handle empty files in a separate directory before pushing to final storage using:

find ./your-directory -type f -empty -print -delete

or if not possible, handle deleting files in your final storage.

Try to list the files to be deleted for sanity check.

  • Related