I am trying to fetch data from a table that returns a list of Row datetime.date objects. I would like to have them as a list of Varchar/String values.
from pyspark.sql import SparkSession
query = "select device_date from device where device is not null"
res=spark.sql(query).collect()
if len(res)!=0:
return res[:20]
The returned value seems to be of format
[Row(device_date =datetime.date(2019, 9, 25)), Row(device_date =datetime.date(2019, 9, 17)), Row(device_date =datetime.date(2020, 1, 8))]
I would like to have the following output returned instead:
['2019-09-25','2019-09-17','2020-01-08']
Please advise.
CodePudding user response:
I suggest you use the date_format
function beforehand.
Here is the documentation, but basically:
>>> from pyspark.sql.functions import date_format
>>> df.select(date_format('device_date', 'YYYY-mm-dd').alias('date')).collect()
[Row(date='2015-04-08')]
Also please be careful, it seems like your column name: "device_date "
has a space at the end. That could be making your life harder.
CodePudding user response:
Are you sure you want to collect your data and then have to process them using python ?
Depending on the answer :
YES (python solution)
out = df.collect()
list(map(lambda x: datetime.datetime.strftime(x.device_date, "%Y-%m-%d"), out))
['2019-09-25', '2019-09-17', '2020-01-08']
NO (Spark solution)
df.select(F.date_format("device_date", "yyyy-MM-dd").alias("device_date")).collect()
[Row(device_date='2019-09-25'),
Row(device_date='2019-09-17'),
Row(device_date='2020-01-08')]