I am writing "delta" format file in AWS s3. Due to some corrupt data I need to delete data , I am using enterprise databricks which can access AWS S3 path, which has delete permission.
While I am trying to delete using below script
val p="s3a://bucket/path1/table_name"
import io.delta.tables.*;
import org.apache.spark.sql.functions;
DeltaTable deltaTable = DeltaTable.forPath(spark, p);
deltaTable.delete("date > '2023-01-01'");
But it is not deleting data in s3 path which is "date > '2023-01-01'". I waited for 1 hour but still I see data , I have run above script multiple times.
So what is wrong here ? how to fix it ?
CodePudding user response:
The DELETE
operation only deletes the data from the delta table, it just dereferences it from the latest version. To delete the data physically from the storage you have to run a VACUUM
command:
Check: https://docs.databricks.com/sql/language-manual/delta-vacuum.html
CodePudding user response:
If you want delete the data physically from s3
you can use dbutils.fs.rm("path")
If you want tp just delete the data run spark.sql("delete from table_name where cond")
or use magic command %sql
and run delete command.
Even you can try vacuum command, but the default retention period is 7 days, if you want to delete the data which is less than 7 days then set this configuration SET spark.databricks.delta.retentionDurationCheck.enabled = false;
and the execute vacuum command