I have a dataframe like below
a | b | date |
1 | 2 | 01/01/2022
2 | 3 | 01/01/2021
3 | 4 |12/20/2021
I have tried the below but it keeps showing the 01/01/2022 date even though 30/12/2021 is not greater than 01/01/2022.
df.filter(("30/12/2021" > col("date"))
I have tried casting both to dates and it returns 0 records then.
df.filter("cast(StartDate as date) >= cast('2017-02-03' as date)")
Below is sample code
from pyspark.shell import spark
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data2 = [(1, 2,"01/01/2022"),
(1, 3,"01/01/2021"),
( 2, 4, "12/20/2021"),
]
schema = StructType([ \
StructField("a", IntegerType(), True), \
StructField("b", IntegerType(), True), \
StructField("date", StringType(), True), \
])
df = spark.createDataFrame(data=data2, schema=schema)
df.filter(("30/12/2021" > col("date"))).show()
CodePudding user response:
You are comparing dates as strings, which will compare alphabetically from the left, so 01/01/2022
is less than 30/12/2021
because 0
is less than 3
.
You need to convert your string to a date, e.g.:
import datetime
s1 = "30/12/2021"
s2 = "01/01/2022"
print(s1 < s2) # False
d1 = datetime.datetime.strptime(s1, "%d/%m/%Y")
d2 = datetime.datetime.strptime(s2, "%d/%m/%Y")
print(d1 < d2) # True
To do this in pyspark, it looks like you'll need to set your field to DateType
and use to_date()
.
CodePudding user response:
Your dates are formatted in this order : month, day, year.
Compare dates as text won't work (unless you change the order in the string).
A good solution is to transform dates from string to datetime.
df["date"] = df["date"].map(lambda x: datetime.strptime(x, "%m/%d/%Y"))
Dataframe types before (result of df.dtypes
) :
a int64
b int64
date object
Dataframe types after (result of df.dtypes
) :
a int64
b int64
date datetime64[ns]