Home > Back-end >  Is it ok to use a delta table tracker based on parquet file name in Azure databricks?
Is it ok to use a delta table tracker based on parquet file name in Azure databricks?

Time:05-03

Today at work i saw a delta lake tracker based on file name. By delta tracker, i mean a function that defines whether a parquet file has already been ingested or not.

The code would check what file (from the delta table) has not already been ingested, and the parquets file in the delta table would then be read using this : spark.createDataFrame(path,StringType())

Having worked with Delta tables, it does not seem ok to me to use a delta tracker that way.

  • In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would be read as a new one?

  • In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record would not be considered ?

  • In case some maintenance is happening on the delta table, what are the chances that some new files are written out of nowhere ? Which may cause a record to be re-ingested

Any observation or suggestion whether it is ok to work that way would be great. Thank you

CodePudding user response:

In Delta Lake everything works on file level. So there are no 'in place' updates or deletes. Say a single record gets deleted (or updated) then roughly the following happens:

  1. Read in the parquet file with the relevant record ( the other records which happen to be in the file)
  2. Write out all records except for the deleted record into a new parquet file
  3. Update the transaction log with a new version, marking the old parquet file as removed and the new parquet file as added. Note the old parquet file doesn't get physically deleted until you run the VACUUM command.

The process for an update is basically the same.

To answer your questions more specifically:

In case record is deleted, what are the chances that the delta log would point to a new file , and that this deleted record would be read as a new one?

The delta log will point to a new file, but the deleted record will not be in there. There will be all the other records which happened to be in the original file.

In case record is updated, what would be the chance that delta log would not point to a new file, and that this updated record would not be considered ?

Files are not updated in place, so this doesn't happen. A new file is written containing the updated record ( any other other records in the original file). The transaction log is updated to 'point' to this new file.

In case some maintenance is happening on the delta table, what are the chances that some new files are written out of nowhere ? Which may cause a record to be re-ingested

This is possible, although not 'out of nowhere'. For example if you run OPTIMIZE existing parquet files get reshuffled/combined to improve performance. Basically this means a number of new parquet files will be written and a new version in the transaction log will point to these parquet files. If you pickup all new files after this you will re-ingest data.

Some considerations: if your delta table is append only you could use structured streaming to read from it instead. If not then Databricks offers Change Data Feed giving your record level details of inserts, updates and deletes.

  • Related