Home > OS >  Timestamp data value different between Hive tables and databricks delta tables
Timestamp data value different between Hive tables and databricks delta tables

Time:01-13

We have done binary copy of data from Hive to ADLS with checksum validated. While values across every datatype matches however timestamp datatype columns are showing change in value between Hive and Delta(Azure Databricks) tables.

select abcdtstmp from  xyz.abc where mn_ID = "sdsdsd-7878-0016" 
2018-01-16 00:00:00.0 (on prem)
select abcdtstmp from  xyz.abc where mn_ID = "sdsdsd-7878-0016" 
2018-01-16T05:00:00.000 0000(DBX)

While checksum and all validation does match, however some values getting added after 'T' is causing concern. Any suggestion would be helpful

CodePudding user response:

This seems to be related to timezone and hive.
Hive always thinks that timestamps in Parquet files are stored in UTC and it will convert them to a local system time (cluster host time) when it outputs. So, even if you are transferring data from EST to EST, its hive that is the culprit.

You can follow this link if you have hive version higher than 1.2 - https://issues.apache.org/jira/browse/HIVE-9482 set hive.parquet.timestamp.skip.conversion=true Else, you need to manually convert the data back to EST or whatever timezone you want using below sql.

from_utc_timestamp(to_utc_timestamp(my_dt_tm,'America/New_York'),'America/Denver') AS local_time
  • Related