Home > Back-end >  Pyspark to filter date in format yyyy/mm/dd
Pyspark to filter date in format yyyy/mm/dd

Time:10-05

I want to filter my data for Datetime coloumn in the format yyy-mm-dd. However, its string value and there is a timestamp associated with the date. I dont want this timestamp in my coloumn. I am using Pyspark for it.

Format of date- 2021/09/23 09:00:00 00,

Format to be done- 2021-09-23

from pyspark.sql.functions import to_date
df = df_pyspark.withColumn("date_only",to_date(col("DateTime"))) #col name in data is DateTime

The date_only is showing null values. How should I approach here?

CodePudding user response:

When using the fonction to_date, you need to pass a format string. The format string can be created using the official documentation for simpleDateFormat - avaialble from the spark documentation directly.
In your case, the format is yyyy/MM/dd HH:mm:ssX :

df.withColumn("t", F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX")).show(truncate=False)                                                                                              
 ---------------------- ---------- 
|DateTime              |t         |
 ---------------------- ---------- 
|2021/09/23 09:00:00 00|2021-09-23|
 ---------------------- ---------- 

you can then filter on the date :

df.where(F.to_date("datetime", "yyyy/MM/dd HH:mm:ssX") == "2021-09-23").show()                                                                                                       
 --------------------                                                           
|            DateTime|
 -------------------- 
|2021/09/23 09:00:...|
 -------------------- 
  • Related