Home > Software engineering >  Converting timestamp to epoch milliseconds in pyspark
Converting timestamp to epoch milliseconds in pyspark

Time:11-14

I have a dataset like the below:

epoch_seconds eq_time
1636663343887 2021-11-12 02:12:23

Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my code:

df = spark.sql("select '1636663343887' as epoch_seconds")
df1 = df.withColumn("eq_time", from_unixtime(col("epoch_seconds") / 1000))

df2 = df1.withColumn("epoch_sec", unix_timestamp(df1.eq_time))
df2.show(truncate=False)   

I am getting output like below:

epoch_seconds eq_time epoch_sec
1636663343887 2021-11-12 02:12:23 1636663343

I tried this link as well but didn't help. My expected output is that the first and third columns should match each other.

P.S: I am using the Spark 3.1.1 version on local whereas it is Spark 2.4.3 in production, and my end goal would be to run it in production.

CodePudding user response:

Use to_timestamp instead of from_unixtime to preserve the milliseconds part when you convert epoch to spark timestamp type.

Then, to go back to timestamp in milliseconds, you can use unix_timestamp function or by casting to long type, and concatenate the result with the fraction of seconds part of the timestamp that you get with date_format using pattern S:

import pyspark.sql.functions as F

df = spark.sql("select '1636663343887' as epoch_ms")

df2 = df.withColumn(
    "eq_time",
    F.to_timestamp(F.col("epoch_ms") / 1000)
).withColumn(
    "epoch_milli",
    F.concat(F.unix_timestamp("eq_time"), F.date_format("eq_time", "S"))
)

df2.show(truncate=False)
# ------------- ----------------------- ------------- 
#|epoch_ms     |eq_time                |epoch_milli  |
# ------------- ----------------------- ------------- 
#|1636663343887|2021-11-11 21:42:23.887|1636663343887|
# ------------- ----------------------- -------------  

CodePudding user response:

To convert between time formats in Python, the datetime.datetime.strptime() and .strftime() are useful.

To read the string from eq_time and process into a Python datetime object:

import datetime
t = datetime.datetime.strptime('2021-11-12 02:12:23', '%Y-%m-%d %H:%M:%S')

To print t in epoch_seconds format:

print(t.strftime('%s')

Pandas has date processing functions which work along similar lines: Applying strptime function to pandas series

You could run this on the eq_time column, immediately after extracting the data, to ensure your DataFrame contains the date in the correct format

  • Related