Structured Streaming Query Fails with "A file referenced in the transaction log cannot be found-CodePudding

I am streaming from a delta table source and my queries keep failing with A file referenced in the transaction log cannot be found. The weird part is that when I run fsck repair table table_name dry run to see which files are missing it returns no results. Why would the streaming query think that there is a file missing from the transaction log while the fsck repair says there are none?

I have also tried running: spark._jvm.com.databricks.sql.transaction.tahoe.DeltaLog.clearCache()

CodePudding user response：

Some causes and solutions to this error:

The underlying data is deleted, before the streaming job starts its processing. To avoid that set the Spark property spark.sql.files.ignoreMissingFiles to true in the cluster’s Spark Config or you must use a new checkpoint directory.

The transaction files are not updated with the latest details still you perform updates with the delta table. To overcome this, you can refresh the table after data loading is finished. To update the transaction files with the latest details you can also run fsck

fsck removes any file entries that cannot be found in the underlying file system from the transaction log of a Delta table.

Reference: A file referenced in the transaction log cannot be found

CodePudding user response：

The issue ended up being that tables were being optimized and then vacuumed and then the streaming job was trying to read the transaction log but the files had been vacuumed. The fix was to increase the vacuum retention to be greater than what the latest reads from the streaming. The reason fsck did not work is because the files were not actually missing from the perspective of the latest state of the table.