Home > Back-end >  Databricks - How to get the current version of delta table parquet files
Databricks - How to get the current version of delta table parquet files

Time:12-02

Say I have a table called data and it's some time-series. It's stored like this:

/data
   /date=2022-11-30
      /region=usa
         part-000001.parquet
         part-000002.parquet

Where I have two partition keys and two partitions for the parquet files. I can easily list the files for the partitions keys with:

dbfs.fs.ls('/data/date=2022-11-30/region=usa')

But, if I now make an update to the table, it regenerates the parquet files and now I have 4 files in that directory.

How can I retrieve the latest version of the parquet files? Do I really have to loop through all the _delta_log state files and rebuild the state? Or do I have to run VACCUM to cleanup the old versions so I can get the most recent files?

There has to be a magic function.

CodePudding user response:

Delta Lake itself tracks all of this information in its transaction log. When you query a Delta table with an engine or API that supports Delta Lake, underneath the covers it is reading this transaction log to determine what files make up that version of the table.

For your example, say the four files are:

/data
   /date=2022-11-30
      /region=usa
         part-000001.parquet
         part-000002.parquet
         part-000003.parquet
         part-000004.parquet

The Delta transaction log itself contains the path of the files for each table version, e.g.:

# VO | first version of the table
/data
   /date=2022-11-30
      /region=usa
         part-000001.parquet
         part-000002.parquet

# V1 | second version of the table
/data
   /date=2022-11-30
      /region=usa
         part-000003.parquet
         part-000004.parquet

You can use Delta Standalone if you want to use the Scala/JVM to get the list of files and/or Delta Rust to use the Delta Rust and/or Python bindings.

If you would like to do it in Spark SQL and/or dive into the details on this, please check out Diving into Delta Lake: Unpacking the Transaction Log which includes video, blog, and notebook on this topic. There is also a follow up video called Under the sediments v2.

  • Related