Home > Software engineering >  PySpark string column breakup based on values
PySpark string column breakup based on values

Time:03-30

I have a pyspark dataframe like this,

 --- ----- 
| id| info|
 --- ----- 
|  1|--XX-|
|  2|XX--X|
 --- ----- 

info column basically encodes the information in each month starting from 2018-01.

I would like to split the dataframe based on each item and created a melted version of it with each month added as a row.

The expected data looks like this,

 --- ------- ---- 
| id|monthid|info|
 --- ------- ---- 
|  1| 201801|   -|
|  1| 201802|   -|
|  1| 201803|   X|
|  1| 201804|   X|
|  1| 201805|   -|
|  2| 201801|   X|
|  2| 201802|   X|
|  2| 201803|   -|
|  2| 201804|   -|
|  2| 201805|   X|
 --- ------- ---- 

My data is huge and I want to do this avoiding any loops. I could think of a way to do this in PySpark easily.

With the wonderful solution from @wwnde, I'm able to generate the results for the example data I shared about. But things get tricky when the data info is more than 12 items long.

See the sample data and expected results below,

 --- ------------- 
| id|         info|
 --- ------------- 
|  1|--XX---------|
|  2|        XX--X|
 --- ------------- 

Expected Result,

 --- ------- ---- 
| id|monthid|info|
 --- ------- ---- 
|  1| 201801|   -|
|  1| 201802|   -|
|  1| 201803|   X|
|  1| 201804|   X|
|  1| 201805|   -|
|  1| 201806|   -|
|  1| 201807|   -|
|  1| 201808|   -|
|  1| 201809|   -|
|  1| 201810|   -|
|  1| 201811|   -|
|  1| 201812|   -|
|  1| 201901|   -|
|  1| 201902|   X|
|  2| 201801|   X|
|  2| 201802|   X|
|  2| 201803|   -|
|  2| 201804|   -|
|  2| 201805|   X|
 --- ------- ---- 

Basically, the monthid column should follow calander months.

CodePudding user response:

df=(
     #replace each character in info with itself followed by comma
    #Use the comma to split it into an array
    #posexplode the array
    df.select('id', posexplode(split(regexp_replace(col('info'),r'(?<=.{1})', r','),'\,')))
    #Compute the info by adding 201801 to pos
    .withColumn('pos',lit(201801) col('pos')).filter(col('col')!="")
    #Rename columns
    .withColumnRenamed("pos","monthid") \
    .withColumnRenamed("col","info")
    
   ).show()

 --- ------- ---- 
| id|monthid|info|
 --- ------- ---- 
|  1| 201801|   -|
|  1| 201802|   -|
|  1| 201803|   X|
|  1| 201804|   X|
|  1| 201805|   -|
|  2| 201801|   X|
|  2| 201802|   X|
|  2| 201803|   -|
|  2| 201804|   -|
|  2| 201805|   X|
 --- ------- ---- 

CodePudding user response:

Working on the answer @wwnde have provided, I was able to extend that for my full usecase of using calander months.

The key was to use add_months to add month and pos columns followed a bit if postprocessing. The code goes like this,

(
     #replace each character in info with itself followed by comma
    #Use the comma to split it into an array
    #posexplode the array
    df.select('id', F.posexplode(F.split(F.regexp_replace(F.col('info'),r'(?<=.{1})', r','),'\,')))
    #Create a dummy column with start date as timestamp and remove empty items from col
    .withColumn("month", F.to_timestamp(F.lit("2018-01-01"))) \
    .filter(F.col('col')!="")
    # use add_months and add the month and pos columns to get actual calander month
    .withColumn('month', F.expr("add_months(month, pos)")) \
    # extract year, month and left pad month with 0 to handle 1 digit months (1-9)
    .withColumn('monthid', F.concat(F.year('month'), F.lpad(F.month('month'), 2, '0')))
    #Rename columns
    .withColumnRenamed("col","info") \
    .drop("month", "pos")
   ).show()


 --- ---- ------- 
| id|info|monthid|
 --- ---- ------- 
|  1|   -| 201801|
|  1|   -| 201802|
|  1|   X| 201803|
|  1|   X| 201804|
|  1|   -| 201805|
|  1|   -| 201806|
|  1|   -| 201807|
|  1|   -| 201808|
|  1|   -| 201809|
|  1|   -| 201810|
|  1|   -| 201811|
|  1|   -| 201812|
|  1|   -| 201901|
|  1|   X| 201902|
|  2|   X| 201801|
|  2|   X| 201802|
|  2|   -| 201803|
|  2|   -| 201804|
|  2|   X| 201805|
 --- ---- ------- 
  • Related