I am using Hive 3.1.0 and my query reads a bunch of parquet files from certain path every hour. I don't have control over how these files are generated as these are created by some external process. In some rare case it happens that within the specified path a certain parquet file may exist with zero size. I would like Hive to ignore this but my hive queries fail with the following error:-
<filename>.parquet is not a Parquet file (too small length: 0)
How do I avoid this ? There could be too many files landing in an hour , so it would be an overkill to create automation to detect and delete empty files. I believe there should be some simpler option in Hive to make it just ignore such files.
CodePudding user response:
Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you can provide the query as how you are trying to access.
CodePudding user response:
I don't know how to do this as hive property. If ever, you might want to handle empty files in a separate directory before pushing to final storage using:
find ./your-directory -type f -empty -print -delete
or if not possible, handle deleting files in your final storage.
Try to list the files to be deleted for sanity check.