I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date
in this column and the minimum date
in the column, for the purpose of filtering to the past numberDays
. I've tried several possibilities but nothing seems to work. Here is my most recent attempt:
df = df.filter(
F.datediff(
F.col("collection_date"),
F.lit(F.min(F.col("collection_date")))
) >= numberDays
)
But I've also tried:
df_new = df.withColumn("days", df.select("collection_date") - df.select("collection_date").min())
and
df_new = df.withColumn("days", df.select("collection_date") - df.select(F.min("collection_date")))
There's probably a few others, but I can't seem to get this to work, although I'm sure there's an incredibly simple answer.
CodePudding user response:
I found a solution that I don't really care for, but it appears to work.
df = df.filter(
F.datediff(
F.col("collection_date"),
F.lit(df.agg(F.min(df["collection_date"])).collect()[0][0])
) >= numberDays
)
I'm don't think it's particularly good practice to put a collect()
operation in the middle of the code, but this works. If anyone has a more "Sparky" solution, please let me know.