Home > database >  Function to explode year range in pyspark
Function to explode year range in pyspark

Time:08-25

I have this dataframe in spark:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("a","2010 - 2012"),
    ("b","from 2020",)
  ]

schema = StructType([ \
    StructField("product",StringType(),True), \
    StructField("reportingYears",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.display()

which essentially prints this:

 ------- -------------- 
|product|reportingYears|
 ------- -------------- 
|      a|   2010 - 2012|
|      b|     from 2020|
|      c|          2010|
 ------- -------------- 

And need a function to explode both of those dateranges as:

 ------- -------------- 
|product|reportingYears|
 ------- -------------- 
|      a|          2010|
|      a|          2011|
|      a|          2012|
|      b|          2020|
|      b|          2021|
|      b|          2022|
|      c|          2010|
 ------- -------------- 

Not sure if there is something in pyspark that can do that, but would need something like python function. Thanks

edit: the dataset also has values in the reportingyears that have only one year with no range.

CodePudding user response:

You can create start and end years and use sequence function to create an array of the years which can be exploded.

data_sdf. \
    withColumn('repyr_split', func.split('reporting_years', '-')). \
    withColumn('start_yr', 
               func.when(func.col('reporting_years').like('from%'), func.substring('reporting_years', -4, 4).cast('int')).
               otherwise(func.trim(func.col('repyr_split')[0]).cast('int'))
               ). \
    withColumn('end_yr', 
               func.when(func.col('reporting_years').like('from%'), func.year(func.current_date())).
               otherwise(func.trim(func.expr('repyr_split[size(repyr_split) - 1]')).cast('int'))
               ). \
    withColumn('yr_seq', func.expr('sequence(start_yr, end_yr, 1)')). \
    selectExpr('product', 'explode(yr_seq) as reporting_years'). \
    show()

#  ------- --------------- 
# |product|reporting_years|
#  ------- --------------- 
# |      a|           2010|
# |      a|           2011|
# |      a|           2012|
# |      b|           2020|
# |      b|           2021|
# |      b|           2022|
# |      c|           2010|
#  ------- --------------- 

The data before explode looks like the following

 ------- --------------- -------------- -------- ------ ------------------ 
|product|reporting_years|   repyr_split|start_yr|end_yr|            yr_seq|
 ------- --------------- -------------- -------- ------ ------------------ 
|      a|    2010 - 2012|[2010 ,  2012]|    2010|  2012|[2010, 2011, 2012]|
|      b|      from 2020|   [from 2020]|    2020|  2022|[2020, 2021, 2022]|
|      c|           2010|        [2010]|    2010|  2010|            [2010]|
 ------- --------------- -------------- -------- ------ ------------------ 
  • Related