Home > Mobile >  Calculate difference between date column entries and date minimum Pyspark
Calculate difference between date column entries and date minimum Pyspark

Time:02-15

I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past numberDays. I've tried several possibilities but nothing seems to work. Here is my most recent attempt:

df = df.filter(
        F.datediff(
            F.col("collection_date"),
            F.lit(F.min(F.col("collection_date")))
        ) >= numberDays
    )

But I've also tried:

df_new = df.withColumn("days", df.select("collection_date") - df.select("collection_date").min())

and

df_new = df.withColumn("days", df.select("collection_date") - df.select(F.min("collection_date")))

There's probably a few others, but I can't seem to get this to work, although I'm sure there's an incredibly simple answer.

CodePudding user response:

I found a solution that I don't really care for, but it appears to work.

df = df.filter(
        F.datediff(
            F.col("collection_date"),
            F.lit(df.agg(F.min(df["collection_date"])).collect()[0][0])
        ) >= numberDays
    )

I'm don't think it's particularly good practice to put a collect() operation in the middle of the code, but this works. If anyone has a more "Sparky" solution, please let me know.

  • Related