I have a column in a dataframe with timestamp data type. The problem is that I need to concatenate it row wise with a column from another dataframe using union
/unionByName
, but... I know that for this to work data types must be identical. So, I need to convert this column into an array so that it'll match the other dataframe's column.
I've tried the following approaches, but it won't work...
df_2 = df.withColumn('date', F.lit(datetime.now(), F.collect_set('date'))
df_2 = df.withColumn('date', F.lit(datetime.now()).withColumn('date', F.collect_set('date').alias('date'))
CodePudding user response:
As you said, to do a union
/unionByName
, data types must be identical. So you also need to find out what data type is contained in that array. If it's timestamp (i.e. array of timestamp), you could turn your timestamp into array of timestamp like this:
F.array('col_name')
Test input:
from pyspark.sql import functions as F
df = (spark.createDataFrame([('2022-09-24 14:20:20',)], ['date'])
.withColumn('date', F.to_timestamp('date'))
)
df.show()
# -------------------
# | date|
# -------------------
# |2022-09-24 14:20:20|
# -------------------
print(df.dtypes)
# [('date', 'timestamp')]
Results:
df = df.withColumn('date', F.array('date'))
df.show(truncate=0)
# ---------------------
# |date |
# ---------------------
# |[2022-09-24 14:20:20]|
# ---------------------
print(df.dtypes)
# [('date', 'array<timestamp>')]