I'm trying to explode a dataframe row made of a "yearMonth" column into multiple rows where each one is a day of that month
this is the example. I want to go from this:
Key | YearMonth |
---|---|
xxx | 202101 |
to this
Key | YearMonthDay |
---|---|
xxx | 20210101 |
xxx | 20210102 |
xxx | ... |
xxx | 20210131 |
CodePudding user response:
Here is an example using some of the built-in spark functions that can be imported using:
import org.apache.spark.sql.functions.{add_months, col, datediff, explode, concat, substring, lit, date_add, sequence, date_format}
The main idea of this solution can be described with the following steps:
- convert the values into a date format as the first date of the month.
- calculate the number of days in the given month.
- create an array of days to add that will be exploded to a separate row each and will be added to the start date of the month
- drop unnecessary columns
List("202010", "202011")
.toDF("month_short")
.withColumn("month", functions.concat(substring($"month_short", 0, 4), lit("-"), substring($"month_short", 5, 2), lit("-"),lit("01")))
.withColumn("days_to_add", explode(sequence(lit(0), datediff(add_months($"month", 1), $"month") - 1)))
.withColumn("day_in_month", date_format(date_add($"month", $"days_to_add"), "yyyyMMdd"))
.drop("days_to_add","month")
.show(false)
output example:
----------- ------------
|month_short|day_in_month|
----------- ------------
|202010 |20201001 |
|202010 |20201002 |
|202010 |20201003 |
|202010 |20201004 |
|202010 |20201005 |
|202010 |20201006 |
|202010 |20201007 |
|202010 |20201008 |
|202010 |20201009 |
|202010 |20201010 |
|202010 |20201011 |
|202010 |20201012 |
CodePudding user response:
df1=input_df.withColumn("YearMonthDay", F.explode(F.sequence(F to_date(F.col("YearMonth").cast("string"), "yyyyMM"), F.last_day(F.to_date(F col("YearMonth").cast("string"), "yyyyMM"))))) .drop("YearMonth")