Home > front end >  Determine if an array of Dates includes a certain date by condition in Pyspark
Determine if an array of Dates includes a certain date by condition in Pyspark

Time:04-23

I have an array column containing dates. I want to return a true false column that searches the array and determines if the array contains a date prior to a certain condition.

Example:

Date_Array Location ID
[4/22/21,3/21/21] Dallas 5
[5/22/21,6/18/21] Dallas 5

If my condition is any date prior to 4/01/21, the result should be this:

Date_Array Location ID bool_date
[4/22/21,3/21/21] Dallas 5 TRUE
[5/22/21,6/18/21] Dallas 5 FALSE

Can this be done with Pyspark or Pandas?

CodePudding user response:

Depends on your actual condition, you can customize the code here, but overall idea is getting the max and min date of the array and use that for any comparison

from pyspark.sql import functions as F

(df
    .withColumn('max_date', F.array_max('date_array'))
    .withColumn('min_date', F.array_min('date_array'))
    .withColumn('bool_date1', (F.col('min_date') < F.lit(datetime.fromisoformat('2021-04-01'))))
    .show()
)

 ------------------------ -------- --- ---------- ---------- ---------- 
|date_array              |location|id |max_date  |min_date  |bool_date1|
 ------------------------ -------- --- ---------- ---------- ---------- 
|[2021-04-22, 2021-03-21]|Dallas  |5  |2021-04-22|2021-03-21|true      |
|[2021-05-22, 2021-06-18]|Dallas  |5  |2021-06-18|2021-05-22|false     |
 ------------------------ -------- --- ---------- ---------- ---------- 
  • Related