Home > OS >  Manually Deleted data file from delta lake
Manually Deleted data file from delta lake

Time:01-24

I have manually deleted a data file from delta lake and now the below command is giving error

mydf = spark.read.format('delta').load('/mnt/path/data')
display(mydf)

Error

A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.microsoft.com/azure/databricks/delta/delta-intro#frequently-asked-questions

i have tried restarting the cluster with no luck also tried the below

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
spark.conf.set("spark.databricks.io.cache.enabled", "false")

Any help on repairing the transaction log or fix the error

CodePudding user response:

as explained before you must use vacuum to remove files as manually deleting files does not lead to the delta transaction log being updated which is what spark uses to identify what files to read.

In your case you can also use the FSCK REPAIR TABLE command. as per the docs : "Removes the file entries from the transaction log of a Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted."

CodePudding user response:

vacuum won't help here as it lists the file-system and deletes all the files that are not committed in the delta log. Committed files (files that are defined in the delta log) won't be deleted.

In order to fix the issue, you can upload an empty file named exactly as the one you've deleted. If you don't know the exact name of the file, you will have to compare the current snapshot of the table using Delta API:

val committedFiles = DeltaLog.forTable(spark, dataPath).snapshot.allFiles

with the existing files in the file-system (Object store) and find the one that exists in committedFiles but not in the file-system.

CodePudding user response:

The above error indicates that you have manually deleted a data file without using the proper DELETE Statement.

As per MS Doc, you can try vacuum command. Using the vacuum command fix the error.

%sql
vacuum 'Your_path'

For more information refer this link

  • Related