Explode single DataFrame row into multiple ones across Year-Month column?-CodePudding

I'm trying to explode a dataframe row made of a "yearMonth" column into multiple rows where each one is a day of that month

this is the example. I want to go from this:

Key	YearMonth
xxx	202101

to this

Key	YearMonthDay
xxx	20210101
xxx	20210102
xxx	...
xxx	20210131

CodePudding user response：

Here is an example using some of the built-in spark functions that can be imported using:

import org.apache.spark.sql.functions.{add_months, col, datediff, explode, concat, substring, lit, date_add, sequence, date_format}

The main idea of this solution can be described with the following steps:

convert the values into a date format as the first date of the month.
calculate the number of days in the given month.
create an array of days to add that will be exploded to a separate row each and will be added to the start date of the month
drop unnecessary columns

List("202010", "202011")
  .toDF("month_short")
  .withColumn("month", functions.concat(substring($"month_short", 0, 4), lit("-"), substring($"month_short", 5, 2), lit("-"),lit("01")))
  .withColumn("days_to_add", explode(sequence(lit(0), datediff(add_months($"month", 1), $"month") - 1)))
  .withColumn("day_in_month", date_format(date_add($"month", $"days_to_add"), "yyyyMMdd"))
  .drop("days_to_add","month")
  .show(false)

output example:

 ----------- ------------ 
|month_short|day_in_month|
 ----------- ------------ 
|202010     |20201001    |
|202010     |20201002    |
|202010     |20201003    |
|202010     |20201004    |
|202010     |20201005    |
|202010     |20201006    |
|202010     |20201007    |
|202010     |20201008    |
|202010     |20201009    |
|202010     |20201010    |
|202010     |20201011    |
|202010     |20201012    |

CodePudding user response：

df1=input_df.withColumn("YearMonthDay", F.explode(F.sequence(F to_date(F.col("YearMonth").cast("string"), "yyyyMM"), F.last_day(F.to_date(F col("YearMonth").cast("string"), "yyyyMM"))))) .drop("YearMonth")