I have dataframe:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts'])
# --- -----------------------
# |ID |max_ts |
# --- -----------------------
# |1 |2018-07-25 17:15:06.39 |
# |2 |2018-07-25 11:12:49.317|
# --- -----------------------
I would like to create a column milliseconds
:
--- ----------------------- ------
|ID |max_ts |ms |
--- ----------------------- ------
|1 |2018-07-25 17:15:06.39 |390000|
|2 |2018-07-25 11:12:49.317|317000|
--- ----------------------- ------
In pandas I can do this with
df_interfax['ms_created_at'] = df_interfax['max_ts'].dt.microsecond
But how can I do it in PySpark?
CodePudding user response:
One option:
from pyspark.sql import functions as F
df = df.withColumn('ms', F.expr("date_part('s', max_ts) % 1 * pow(10, 6)"))
df.show(truncate=0)
# --- ----------------------- --------
# |ID |max_ts |ms |
# --- ----------------------- --------
# |1 |2018-07-25 17:15:06.39 |390000.0|
# |2 |2018-07-25 11:12:49.317|317000.0|
# --- ----------------------- --------
Another option:
df = df.withColumn('ms', F.expr("unix_micros(max_ts) - unix_micros(date_trunc('second', max_ts))"))
df.show(truncate=0)
# --- ----------------------- ------
# |ID |max_ts |ms |
# --- ----------------------- ------
# |1 |2018-07-25 17:15:06.39 |390000|
# |2 |2018-07-25 11:12:49.317|317000|
# --- ----------------------- ------