I have this dataframe in spark:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("a","2010 - 2012"),
("b","from 2020",)
]
schema = StructType([ \
StructField("product",StringType(),True), \
StructField("reportingYears",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.display()
which essentially prints this:
------- --------------
|product|reportingYears|
------- --------------
| a| 2010 - 2012|
| b| from 2020|
| c| 2010|
------- --------------
And need a function to explode both of those dateranges as:
------- --------------
|product|reportingYears|
------- --------------
| a| 2010|
| a| 2011|
| a| 2012|
| b| 2020|
| b| 2021|
| b| 2022|
| c| 2010|
------- --------------
Not sure if there is something in pyspark that can do that, but would need something like python function. Thanks
edit: the dataset also has values in the reportingyears that have only one year with no range.
CodePudding user response:
You can create start and end years and use sequence
function to create an array of the years which can be exploded.
data_sdf. \
withColumn('repyr_split', func.split('reporting_years', '-')). \
withColumn('start_yr',
func.when(func.col('reporting_years').like('from%'), func.substring('reporting_years', -4, 4).cast('int')).
otherwise(func.trim(func.col('repyr_split')[0]).cast('int'))
). \
withColumn('end_yr',
func.when(func.col('reporting_years').like('from%'), func.year(func.current_date())).
otherwise(func.trim(func.expr('repyr_split[size(repyr_split) - 1]')).cast('int'))
). \
withColumn('yr_seq', func.expr('sequence(start_yr, end_yr, 1)')). \
selectExpr('product', 'explode(yr_seq) as reporting_years'). \
show()
# ------- ---------------
# |product|reporting_years|
# ------- ---------------
# | a| 2010|
# | a| 2011|
# | a| 2012|
# | b| 2020|
# | b| 2021|
# | b| 2022|
# | c| 2010|
# ------- ---------------
The data before explode
looks like the following
------- --------------- -------------- -------- ------ ------------------
|product|reporting_years| repyr_split|start_yr|end_yr| yr_seq|
------- --------------- -------------- -------- ------ ------------------
| a| 2010 - 2012|[2010 , 2012]| 2010| 2012|[2010, 2011, 2012]|
| b| from 2020| [from 2020]| 2020| 2022|[2020, 2021, 2022]|
| c| 2010| [2010]| 2010| 2010| [2010]|
------- --------------- -------------- -------- ------ ------------------