Below is my input schema.
root
|-- Date: string (nullable = true)
|-- tax: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- hst: double (nullable = true)
| | |-- gst: double (nullable = true)
My required output is as below.
----------- --------- ------------------ ----------------------
| DATE | ID |hst | gst |
----------- --------- ------------------ ----------------------
|14-Jun-2022| 1 | 100.0| 12.0|
|14-Jun-2022| 2 | 80.0| 6.0|
----------- --------- ------------------ ----------------------
I am using the below code to explode and select the columns from the exploded result.
df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.ID"),col("exploded_column.hst"),col("exploded_column.gst"))
I was able to get the output. But I want to know if there is any other way to achieve this if my source dataframe changes frequently.
Thank you.
CodePudding user response:
If the exploded column remains a struct you can use ".*" to get all the fields.
df = sourceDF.withColumn("exploded_column", explode(col("tax")))
explodeddf = testdf.select(col("DATE"),col("exploded_column.*"))
You can dynamically generate varargs to pass to select with the following, it will generate columns. It's similar to '*' but can also be used with a filter to remove columns you wish to not include
df = sourceDF.select( col("DATE"), *[col(column) for column in sourceDF.columns if column != "DATE"] )
I haven't tried combining the two but it might be possible.