Home > Software design >  Explode single DataFrame row into multiple ones across Year-Month column?
Explode single DataFrame row into multiple ones across Year-Month column?

Time:04-08

I'm trying to explode a dataframe row made of a "yearMonth" column into multiple rows where each one is a day of that month

this is the example. I want to go from this:

Key YearMonth
xxx 202101

to this

Key YearMonthDay
xxx 20210101
xxx 20210102
xxx ...
xxx 20210131

CodePudding user response:

Here is an example using some of the built-in spark functions that can be imported using:

import org.apache.spark.sql.functions.{add_months, col, datediff, explode, concat, substring, lit, date_add, sequence, date_format}

The main idea of this solution can be described with the following steps:

  1. convert the values into a date format as the first date of the month.
  2. calculate the number of days in the given month.
  3. create an array of days to add that will be exploded to a separate row each and will be added to the start date of the month
  4. drop unnecessary columns
List("202010", "202011")
  .toDF("month_short")
  .withColumn("month", functions.concat(substring($"month_short", 0, 4), lit("-"), substring($"month_short", 5, 2), lit("-"),lit("01")))
  .withColumn("days_to_add", explode(sequence(lit(0), datediff(add_months($"month", 1), $"month") - 1)))
  .withColumn("day_in_month", date_format(date_add($"month", $"days_to_add"), "yyyyMMdd"))
  .drop("days_to_add","month")
  .show(false)

output example:

 ----------- ------------ 
|month_short|day_in_month|
 ----------- ------------ 
|202010     |20201001    |
|202010     |20201002    |
|202010     |20201003    |
|202010     |20201004    |
|202010     |20201005    |
|202010     |20201006    |
|202010     |20201007    |
|202010     |20201008    |
|202010     |20201009    |
|202010     |20201010    |
|202010     |20201011    |
|202010     |20201012    |

CodePudding user response:

df1=input_df.withColumn("YearMonthDay", F.explode(F.sequence(F to_date(F.col("YearMonth").cast("string"), "yyyyMM"), F.last_day(F.to_date(F col("YearMonth").cast("string"), "yyyyMM"))))) .drop("YearMonth")

  • Related