I've got PySpark dataframe with column "date" which represents unix time in float type (like this 1.63144269E9). When I convert this time to "yyyy-MM-dd HH:mm:ss.SSS" datetime format, PySpark gives me incorrect values.
For instance, converting unix time 1631442679.384516 to datetime PySpark gives "2021-09-12 12:31:28.000". The number of hours, seconds and milliseconds are wrong.
I've tried different PySpark finctions they didn't gave correct values.
For example:
j = df.withColumn('epoch', f.from_unixtime(f.col("date"), "yyyy-MM-dd HH:mm:ss.SSS"))
j.select("epoch").show(5, False)
-----------------------
|epoch |
-----------------------
|2021-09-12 12:31:28.000|
|2021-09-12 12:31:28.000|
|2021-09-12 12:31:28.000|
|2021-09-12 12:31:28.000|
|2021-09-12 12:31:28.000|
-----------------------
When I create dataframe with scheme where "date" is StringType, the results is shown below.
-----------------------
|epoch |
-----------------------
|2021-09-12 12:31:19.000|
|2021-09-12 12:31:19.000|
|2021-09-12 12:31:19.000|
|2021-09-12 12:31:19.000|
|2021-09-12 12:31:20.000|
-----------------------
Now seconds are right but hours and milliseconds are still wrong.
What am I doing wrong?
UPDATE 1.0
I switched FloatType to DecimalType in scheme of dataframe and used the next code:
j = df.withColumn('epoch', f.col('date').cast("timestamp"))
j.select("epoch").show(5, False)
Result:
-----------------------
|epoch |
-----------------------
|2021-09-12 12:31:19.385|
|2021-09-12 12:31:19.435|
|2021-09-12 12:31:19.547|
|2021-09-12 12:31:19.571|
|2021-09-12 12:31:20.012|
-----------------------
The hour number is still wrong. I can try to use from_utc_timestamp(j.epoch, "GMT-3")
and time will be correct but this functions are too long in performance. I guess df.withColumn('epoch', f.col('date').cast("timestamp"))
use timezone from my PC but I don't know how to fix it in better way.
UPDATE 2.0
I used a simple decision to subtract two hours and it works fine.
j = df.withColumn('epoch', (f.col('date')- 7200).cast("timestamp"))
j.select("epoch").show(5, False)
Result:
-----------------------
|epoch |
-----------------------
|2021-09-12 10:31:19.385|
|2021-09-12 10:31:19.435|
|2021-09-12 10:31:19.547|
|2021-09-12 10:31:19.571|
|2021-09-12 10:31:20.012|
-----------------------
CodePudding user response:
According to from_unixtime
and default configurations, the default value of spark.sql.session.timeZone
is Spark local time, that's why you see some hours different here. The correct way to handle it is changing Spark timezone as your expectation.
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")