I have a pyspark dataframe like this,
--- -----
| id| info|
--- -----
| 1|--XX-|
| 2|XX--X|
--- -----
info column basically encodes the information in each month starting from 2018-01.
I would like to split the dataframe based on each item and created a melted version of it with each month added as a row.
The expected data looks like this,
--- ------- ----
| id|monthid|info|
--- ------- ----
| 1| 201801| -|
| 1| 201802| -|
| 1| 201803| X|
| 1| 201804| X|
| 1| 201805| -|
| 2| 201801| X|
| 2| 201802| X|
| 2| 201803| -|
| 2| 201804| -|
| 2| 201805| X|
--- ------- ----
My data is huge and I want to do this avoiding any loops. I could think of a way to do this in PySpark easily.
With the wonderful solution from @wwnde, I'm able to generate the results for the example data I shared about. But things get tricky when the data info is more than 12 items long.
See the sample data and expected results below,
--- -------------
| id| info|
--- -------------
| 1|--XX---------|
| 2| XX--X|
--- -------------
Expected Result,
--- ------- ----
| id|monthid|info|
--- ------- ----
| 1| 201801| -|
| 1| 201802| -|
| 1| 201803| X|
| 1| 201804| X|
| 1| 201805| -|
| 1| 201806| -|
| 1| 201807| -|
| 1| 201808| -|
| 1| 201809| -|
| 1| 201810| -|
| 1| 201811| -|
| 1| 201812| -|
| 1| 201901| -|
| 1| 201902| X|
| 2| 201801| X|
| 2| 201802| X|
| 2| 201803| -|
| 2| 201804| -|
| 2| 201805| X|
--- ------- ----
Basically, the monthid column should follow calander months.
CodePudding user response:
df=(
#replace each character in info with itself followed by comma
#Use the comma to split it into an array
#posexplode the array
df.select('id', posexplode(split(regexp_replace(col('info'),r'(?<=.{1})', r','),'\,')))
#Compute the info by adding 201801 to pos
.withColumn('pos',lit(201801) col('pos')).filter(col('col')!="")
#Rename columns
.withColumnRenamed("pos","monthid") \
.withColumnRenamed("col","info")
).show()
--- ------- ----
| id|monthid|info|
--- ------- ----
| 1| 201801| -|
| 1| 201802| -|
| 1| 201803| X|
| 1| 201804| X|
| 1| 201805| -|
| 2| 201801| X|
| 2| 201802| X|
| 2| 201803| -|
| 2| 201804| -|
| 2| 201805| X|
--- ------- ----
CodePudding user response:
Working on the answer @wwnde have provided, I was able to extend that for my full usecase of using calander months.
The key was to use add_months to add month and pos columns followed a bit if postprocessing. The code goes like this,
(
#replace each character in info with itself followed by comma
#Use the comma to split it into an array
#posexplode the array
df.select('id', F.posexplode(F.split(F.regexp_replace(F.col('info'),r'(?<=.{1})', r','),'\,')))
#Create a dummy column with start date as timestamp and remove empty items from col
.withColumn("month", F.to_timestamp(F.lit("2018-01-01"))) \
.filter(F.col('col')!="")
# use add_months and add the month and pos columns to get actual calander month
.withColumn('month', F.expr("add_months(month, pos)")) \
# extract year, month and left pad month with 0 to handle 1 digit months (1-9)
.withColumn('monthid', F.concat(F.year('month'), F.lpad(F.month('month'), 2, '0')))
#Rename columns
.withColumnRenamed("col","info") \
.drop("month", "pos")
).show()
--- ---- -------
| id|info|monthid|
--- ---- -------
| 1| -| 201801|
| 1| -| 201802|
| 1| X| 201803|
| 1| X| 201804|
| 1| -| 201805|
| 1| -| 201806|
| 1| -| 201807|
| 1| -| 201808|
| 1| -| 201809|
| 1| -| 201810|
| 1| -| 201811|
| 1| -| 201812|
| 1| -| 201901|
| 1| X| 201902|
| 2| X| 201801|
| 2| X| 201802|
| 2| -| 201803|
| 2| -| 201804|
| 2| X| 201805|
--- ---- -------