How can I create this Spark dataframe with timestamp data type in one step using python? Here is how I do it in two steps. Using spark 3.1.2
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema_sdf = StructType([
StructField("ts", TimestampType(), True),
StructField("myColumn", LongType(), True),
])
sdf = spark.createDataFrame( ( [ ( to_timestamp(lit("2022-06-29 12:01:19.000")), 0 ) ] ), schema=schema_sdf )
CodePudding user response:
PySpark does not automatically interpret timestamp values from strings. I mostly use the following syntax to create the df and then to cast
column type to timestamp:
from pyspark.sql import functions as F
sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf = sdf.withColumn("ts", F.col("ts").cast("timestamp"))
sdf.printSchema()
# root
# |-- ts: timestamp (nullable = true)
# |-- myColumn: long (nullable = true)
Long format was automatically inferred, but for timestamp we needed a cast
.
On the other hand, even without casting, you are able to use functions which need timestamp as input:
sdf = spark.createDataFrame([("2022-06-29 12:01:19.000", 0 )], ["ts", "myColumn"])
sdf.printSchema()
# root
# |-- ts: string (nullable = true)
# |-- myColumn: long (nullable = true)
sdf.selectExpr("extract(year from ts)").show()
# ---------------------
# |extract(year FROM ts)|
# ---------------------
# | 2022|
# ---------------------