Spark: refresh Delta Table in S3-CodePudding

how can I run the refresh table command on a Delta Table in S3? When I do

deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)

I am getting the error:

AttributeError: 'DeltaTable' object has no attribute '_get_object_id'

Does the refresh command only work for Hive tables?

Thanks!

CodePudding user response：

Ok. It's really an incorrect function - the spark.catalog.refreshTable function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.

To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):

If removed files were in the recent version, then you may try to use RESTORE command with spark.sql.files.ignoreMissingFiles set to true
If removed files were for the specific partition, then you can read the table (again with spark.sql.files.ignoreMissingFiles set to true), leave data only for that partitions, and write data using overwrite mode with replaceWhere option (doc) that contains condition
Or you can read the whole Delta table (again with spark.sql.files.ignoreMissingFiles set to true) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.