how can I run the refresh table command on a Delta Table in S3? When I do
deltatable = DeltaTable.forPath(spark, "s3a://test-bucket/delta_table/")
spark.catalog.refreshTable(deltatable)
I am getting the error:
AttributeError: 'DeltaTable' object has no attribute '_get_object_id'
Does the refresh command only work for Hive tables?
Thanks!
CodePudding user response:
Ok. It's really an incorrect function - the spark.catalog.refreshTable
function (doc) is used to refresh table metadata inside the Spark. It has nothing to do with recovery of the Delta table.
To fix this on Delta you need to do something different. Unfortunately I'm not 100% sure about right way for open source Delta implementation - on Databricks we have FSCK REPAIR TABLE SQL command for that. I would try following (be careful, make a backup!):
If removed files were in the recent version, then you may try to use RESTORE command with
spark.sql.files.ignoreMissingFiles
set totrue
If removed files were for the specific partition, then you can read the table (again with
spark.sql.files.ignoreMissingFiles
set totrue
), leave data only for that partitions, and write data using overwrite mode withreplaceWhere
option (doc) that contains conditionOr you can read the whole Delta table (again with
spark.sql.files.ignoreMissingFiles
set totrue
) and write it back in Overwrite mode - it will of course duplicate your data, but the old files will be removed by the VACUUM.