I have this DataFrame of Spark:
------------------- -------------------
| first | last |
------------------- -------------------
|2022-11-03 00:00:00|2022-11-06 00:00:00|
------------------- -------------------
I need to get the dates from the first, not including, to the last, including, jumping from day to day.
My desired output would be:
------------------- ------------------- -----------------------------------------------------------------
| first | last | array_dates
------------------- ------------------- -----------------------------------------------------------------
|2022-11-03 00:00:00|2022-11-06 00:00:00| [2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00] |
------------------- ------------------- -----------------------------------------------------------------
CodePudding user response:
Since Spark, 2.4, you can use sequence built-in spark function as follows:
import org.apache.spark.sql.functions.{col, date_add, expr, sequence}
val result = df.withColumn(
"array_dates",
sequence(
date_add(col("first"), 1),
col("last"),
expr("INTERVAL 1 DAY")
)
)
with following input df
:
------------------- -------------------
|first |last |
------------------- -------------------
|2022-11-03 00:00:00|2022-11-06 00:00:00|
------------------- -------------------
You get the following output result
:
------------------- ------------------- ---------------------------------------------------------------
|first |last |array_dates |
------------------- ------------------- ---------------------------------------------------------------
|2022-11-03 00:00:00|2022-11-06 00:00:00|[2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00]|
------------------- ------------------- ---------------------------------------------------------------