Home > Enterprise >  Exploding JSON array in Pyspark when source changes frequently
Exploding JSON array in Pyspark when source changes frequently

Time:06-17

Below is my input schema.

root
 |-- Date: string (nullable = true)
 |-- tax: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- hst: double (nullable = true)
 |    |    |-- gst: double (nullable = true)

My required output is as below.

 ----------- --------- ------------------ ---------------------- 
|  DATE    |  ID      |hst               |   gst                |
 ----------- --------- ------------------ ---------------------- 
|14-Jun-2022|     1   |             100.0|                  12.0|
|14-Jun-2022|     2   |              80.0|                   6.0|
 ----------- --------- ------------------ ---------------------- 

I am using the below code to explode and select the columns from the exploded result.

df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.ID"),col("exploded_column.hst"),col("exploded_column.gst"))

I was able to get the output. But I want to know if there is any other way to achieve this if my source dataframe changes frequently.

Thank you.

CodePudding user response:

If the exploded column remains a struct you can use ".*" to get all the fields.

df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.*"))

You can dynamically generate varargs to pass to select with the following, it will generate columns. It's similar to '*' but can also be used with a filter to remove columns you wish to not include

df = sourceDF.select( col("DATE"), *[col(column) for column in sourceDF.columns if column != "DATE"] )

I haven't tried combining the two but it might be possible.

  • Related