I have an array column containing dates. I want to return a true false column that searches the array and determines if the array contains a date prior to a certain condition.
Example:
Date_Array | Location | ID |
---|---|---|
[4/22/21,3/21/21] | Dallas | 5 |
[5/22/21,6/18/21] | Dallas | 5 |
If my condition is any date prior to 4/01/21, the result should be this:
Date_Array | Location | ID | bool_date |
---|---|---|---|
[4/22/21,3/21/21] | Dallas | 5 | TRUE |
[5/22/21,6/18/21] | Dallas | 5 | FALSE |
Can this be done with Pyspark or Pandas?
CodePudding user response:
Depends on your actual condition, you can customize the code here, but overall idea is getting the max and min date of the array and use that for any comparison
from pyspark.sql import functions as F
(df
.withColumn('max_date', F.array_max('date_array'))
.withColumn('min_date', F.array_min('date_array'))
.withColumn('bool_date1', (F.col('min_date') < F.lit(datetime.fromisoformat('2021-04-01'))))
.show()
)
------------------------ -------- --- ---------- ---------- ----------
|date_array |location|id |max_date |min_date |bool_date1|
------------------------ -------- --- ---------- ---------- ----------
|[2021-04-22, 2021-03-21]|Dallas |5 |2021-04-22|2021-03-21|true |
|[2021-05-22, 2021-06-18]|Dallas |5 |2021-06-18|2021-05-22|false |
------------------------ -------- --- ---------- ---------- ----------