Home > OS >  Can you fake the timestamp of DeltaTable history in PySpark?
Can you fake the timestamp of DeltaTable history in PySpark?

Time:01-07

For testing purposes I wanna get the version of a table by timestamp using e.g. option={'timestampAsOf': '2022-01-01 23:59:59'}

But as I understand the timestamp in the history is always the current timestamp of the operation.

Can I fake the timestamp, i.e. force it to be some determined datetime?

That way I could write better tests!

I tried freezgun.freeze_time but that didn't get me far!

Any ideas? Maybe @Denny Lee

CodePudding user response:

You would have to edit the commit log. Each delta table has a folder called _delta_log that contains the commit log for that table. Within that folder, there's a json file for each version of the folder. That json file has a property called "commitInfo" which in turn has a "timestamp", e.g.

{
  "commitInfo": {
    "timestamp": 12345678 <- set this to something else?
  }
}

You could change that file. There is also a similarly-named crc file that has a "createdTime" property - you might need to change that, too.

AFAIK there is no utility for editing delta commit histories. Seems like an interesting project.

CodePudding user response:

Right now, you cannot change the current timestamp of an operation. Saying this, perhaps open an issue in the Delta Lake GitHub and we can work together on doing something like this.

BTW, we have an incubation project called Delta Acceptance Testing or DAT which should help your testing.

  • Related