We have done binary copy of data from Hive to ADLS with checksum validated. While values across every datatype matches however timestamp datatype columns are showing change in value between Hive and Delta(Azure Databricks) tables.
select abcdtstmp from xyz.abc where mn_ID = "sdsdsd-7878-0016"
2018-01-16 00:00:00.0 (on prem)
select abcdtstmp from xyz.abc where mn_ID = "sdsdsd-7878-0016"
2018-01-16T05:00:00.000 0000(DBX)
While checksum and all validation does match, however some values getting added after 'T' is causing concern. Any suggestion would be helpful
CodePudding user response:
This seems to be related to timezone
and hive.
Hive always thinks that timestamps in Parquet files are stored in UTC and it will convert them to a local system time (cluster host time) when it outputs. So, even if you are transferring data from EST to EST, its hive that is the culprit.
You can follow this link if you have hive version higher than 1.2 - https://issues.apache.org/jira/browse/HIVE-9482
set hive.parquet.timestamp.skip.conversion=true
Else, you need to manually convert the data back to EST or whatever timezone you want using below sql.
from_utc_timestamp(to_utc_timestamp(my_dt_tm,'America/New_York'),'America/Denver') AS local_time