df['DATE'].apply(lambda x: x.strftime("%Y%m%d")).astype('float64')
Provides an error of
TypeError: 'Column' object is not callable
How would I convert this syntax to comply with pyspark?
CodePudding user response:
a simple way to format 'yyyy-MM-dd' to 'yyyyMMdd'
data= [
('2022-08-10', 1),
('2022-08-09', 2),
]
df = spark.createDataFrame(data, ['DATE','idx'])
df.printSchema()
# root
# |-- DATE: string (nullable = true)
# |-- idx: long (nullable = true)
df = df.withColumn('DATE', regexp_replace(col('DATE'), '-', '').cast('long'))
df.printSchema()
# root
# |-- DATE: long (nullable = true)
# |-- idx: long (nullable = true)
df.show(10, False)
# -------- ---
# |DATE |idx|
# -------- ---
# |20220810|1 |
# |20220809|2 |
# -------- ---