I have a spark streaming query running in databricks. While loading data from a kafka topic to delta lake, the cell output while running displays "Compute snapshot for version : 3001". I saw this message many times before but it was the first time I'm seeing an abnormally huge number.
What exactly does this message mean ? How should one intrepret what's happening under the hood? Also, does having a high number have any impact on performance of the task ?
CodePudding user response:
From the question I've inferred that you are saving the data into a Delta Lake
format, which by design have a concept of Time Travel, which in principle allows you to track the changes in the underlying data by saving so-called snapshots of the table:
- more about Snaphots - https://books.japila.pl/delta-lake-internals/Snapshot/
- more about Time Travel in Delta Lake - https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel