I am using spark 3.2, and trying to explode data between 2 months, so that I can get a column which contains all months between those 2 dates.
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.DateType), functions.lit("2016-05-31").cast(DataTypes.DateType), functions.expr("INTERVAL 1 month"))));}
It's giving me output as
MONTH 2016-02-01 2016-03-01 2016-03-31 2016-04-30 2016-05-31
Expected Output should be:
MONTH 2016-02-01 2016-03-01 2016-04-01 2016-05-01
Can you please let us know where I did wrong?
CodePudding user response:
The problem is related to timezone.
Solution 1: Change DateType to TimestampType:
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.TimestampType), functions.lit("2016-05-31").cast(DataTypes.TimestampType), functions.expr("INTERVAL 1 month"))));}
Solution 2: Set timezone to UTC:
spark.conf.set("spark.sql.session.timeZone", "UTC")
dataSetFromFile = dataSetFromFile.withColumn("MONTH",functions.explode(functions.sequence( functions.lit("2016-02-01").cast(DataTypes.DateType), functions.lit("2016-05-31").cast(DataTypes.DateType), functions.expr("INTERVAL 1 month"))));}