Home > Blockchain >  Get microsecond in PySpark dataframe
Get microsecond in PySpark dataframe

Time:10-04

I have dataframe:

from datetime import datetime
data = [
    (1, datetime(2018, 7, 25, 17, 15, 6, 390000)),
    (2, datetime(2018, 7, 25, 11, 12, 49, 317000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts'])
#  --- ----------------------- 
# |ID |max_ts                 |
#  --- ----------------------- 
# |1  |2018-07-25 17:15:06.39 |
# |2  |2018-07-25 11:12:49.317|
#  --- ----------------------- 

I would like to create a column milliseconds:

 --- ----------------------- ------ 
|ID |max_ts                 |ms    |
 --- ----------------------- ------ 
|1  |2018-07-25 17:15:06.39 |390000|
|2  |2018-07-25 11:12:49.317|317000|
 --- ----------------------- ------ 

In pandas I can do this with

df_interfax['ms_created_at'] = df_interfax['max_ts'].dt.microsecond

But how can I do it in PySpark?

CodePudding user response:

One option:

from pyspark.sql import functions as F

df = df.withColumn('ms', F.expr("date_part('s', max_ts) % 1 * pow(10, 6)"))

df.show(truncate=0)
#  --- ----------------------- -------- 
# |ID |max_ts                 |ms      |
#  --- ----------------------- -------- 
# |1  |2018-07-25 17:15:06.39 |390000.0|
# |2  |2018-07-25 11:12:49.317|317000.0|
#  --- ----------------------- -------- 

Another option:

df = df.withColumn('ms', F.expr("unix_micros(max_ts) - unix_micros(date_trunc('second', max_ts))"))

df.show(truncate=0)
#  --- ----------------------- ------ 
# |ID |max_ts                 |ms    |
#  --- ----------------------- ------ 
# |1  |2018-07-25 17:15:06.39 |390000|
# |2  |2018-07-25 11:12:49.317|317000|
#  --- ----------------------- ------ 
  • Related