I have a dataset like the below:
epoch_seconds | eq_time |
---|---|
1636663343887 | 2021-11-12 02:12:23 |
Now, I am trying to convert the eq_time
to epoch
seconds which should match the value of the first column but am unable to do so. Below is my code:
df = spark.sql("select '1636663343887' as epoch_seconds")
df1 = df.withColumn("eq_time", from_unixtime(col("epoch_seconds") / 1000))
df2 = df1.withColumn("epoch_sec", unix_timestamp(df1.eq_time))
df2.show(truncate=False)
I am getting output like below:
epoch_seconds | eq_time | epoch_sec |
---|---|---|
1636663343887 | 2021-11-12 02:12:23 | 1636663343 |
I tried this link as well but didn't help. My expected
output is that the first and third columns should match each other.
P.S: I am using the Spark 3.1.1
version on local whereas it is Spark 2.4.3
in production, and my end goal would be to run it in production.
CodePudding user response:
Use to_timestamp
instead of from_unixtime
to preserve the milliseconds part when you convert epoch to spark timestamp type.
Then, to go back to timestamp in milliseconds, you can use unix_timestamp
function or by casting to long type, and concatenate the result with the fraction of seconds part of the timestamp that you get with date_format
using pattern S
:
import pyspark.sql.functions as F
df = spark.sql("select '1636663343887' as epoch_ms")
df2 = df.withColumn(
"eq_time",
F.to_timestamp(F.col("epoch_ms") / 1000)
).withColumn(
"epoch_milli",
F.concat(F.unix_timestamp("eq_time"), F.date_format("eq_time", "S"))
)
df2.show(truncate=False)
# ------------- ----------------------- -------------
#|epoch_ms |eq_time |epoch_milli |
# ------------- ----------------------- -------------
#|1636663343887|2021-11-11 21:42:23.887|1636663343887|
# ------------- ----------------------- -------------
CodePudding user response:
To convert between time formats in Python, the datetime.datetime.strptime()
and .strftime()
are useful.
To read the string from eq_time and process into a Python datetime object:
import datetime
t = datetime.datetime.strptime('2021-11-12 02:12:23', '%Y-%m-%d %H:%M:%S')
To print t in epoch_seconds format:
print(t.strftime('%s')
Pandas has date processing functions which work along similar lines: Applying strptime function to pandas series
You could run this on the eq_time column, immediately after extracting the data, to ensure your DataFrame contains the date in the correct format