I have a Date
and an Hour
column in a PySpark dataframe. How do I merge these together to get the Desired_Calculated_Result
column?
df1 = sqlContext.createDataFrame(
[
('2021-10-20','1300', '2021-10-20 13:00:00.000 0000')
,('2021-10-20','1400', '2021-10-20 14:00:00.000 0000')
,('2021-10-20','1500', '2021-10-20 15:00:00.000 0000')
]
,['Date', 'Hour', 'Desired_Calculated_Result']
)
I also tried:
df1.withColumn("TimeStamp", unix_timestamp(concat_ws(" ", df1.Date, df1.Hour), "yyyy-MM-dd HHmm").cast("timestamp")).show().
This returned all nulls in the Timestamp Column
CodePudding user response:
from pyspark.sql.functions import concat, unix_timestamp
df1\
.withColumn("TimeStamp", unix_timestamp(concat(df1.Date, df1.Hour), "yyyy-MM-ddHHmm")\
.cast("timestamp"))\
.show()