I want to filter my data for Datetime coloumn in the format yyy-mm-dd. However, its string value and there is a timestamp associated with the date. I dont want this timestamp in my coloumn. I am using Pyspark for it.
Format of date- 2021/09/23 09:00:00 00,
Format to be done- 2021-09-23
from pyspark.sql.functions import to_date
df = df_pyspark.withColumn("date_only",to_date(col("DateTime"))) #col name in data is DateTime
The date_only is showing null values. How should I approach here?
CodePudding user response:
When using the fonction to_date
, you need to pass a format string. The format string can be created using the official documentation for simpleDateFormat - avaialble from the spark documentation directly.
In your case, the format is yyyy/MM/dd HH:mm:ssX
:
df.withColumn("t", F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX")).show(truncate=False)
---------------------- ----------
|DateTime |t |
---------------------- ----------
|2021/09/23 09:00:00 00|2021-09-23|
---------------------- ----------
you can then filter on the date :
df.where(F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX") == "2021-09-23").show()
--------------------
| DateTime|
--------------------
|2021/09/23 09:00:...|
--------------------