Home > Mobile >  comparing date strings not working as expected
comparing date strings not working as expected

Time:11-11

I have a dataframe like below

a | b     | date |
1 |  2    | 01/01/2022
2 |  3    | 01/01/2021
3 |  4    |12/20/2021

I have tried the below but it keeps showing the 01/01/2022 date even though 30/12/2021 is not greater than 01/01/2022.

 df.filter(("30/12/2021" > col("date")) 

I have tried casting both to dates and it returns 0 records then.

df.filter("cast(StartDate as date) >= cast('2017-02-03' as date)")

Below is sample code

from pyspark.shell import spark
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data2 = [(1, 2,"01/01/2022"),
        (1, 3,"01/01/2021"),
         ( 2, 4, "12/20/2021"),

         ]

schema = StructType([ \
    StructField("a", IntegerType(), True), \
    StructField("b", IntegerType(), True), \
    StructField("date", StringType(), True), \
    ])

df = spark.createDataFrame(data=data2, schema=schema)


df.filter(("30/12/2021" > col("date"))).show()

CodePudding user response:

You are comparing dates as strings, which will compare alphabetically from the left, so 01/01/2022 is less than 30/12/2021 because 0 is less than 3.

You need to convert your string to a date, e.g.:

import datetime

s1 = "30/12/2021"
s2 = "01/01/2022"
print(s1 < s2)  # False

d1 = datetime.datetime.strptime(s1, "%d/%m/%Y")
d2 = datetime.datetime.strptime(s2, "%d/%m/%Y")
print(d1 < d2)  # True

To do this in pyspark, it looks like you'll need to set your field to DateType and use to_date().

CodePudding user response:

Your dates are formatted in this order : month, day, year.

Compare dates as text won't work (unless you change the order in the string).

A good solution is to transform dates from string to datetime.

df["date"] = df["date"].map(lambda x: datetime.strptime(x, "%m/%d/%Y"))

Dataframe types before (result of df.dtypes) :

a        int64
b        int64
date    object

Dataframe types after (result of df.dtypes) :

a                int64
b                int64
date    datetime64[ns]
  • Related