Home > front end >  Pyspark Struct columns: weird behavior after exploding
Pyspark Struct columns: weird behavior after exploding

Time:11-10

I am working with a nested json file that contains arrays on several levels that I need to explode. I have noticed that the values in a column I created have changed after I exploded an array which is one the same level as the value I wanted to store in said column. Here is an example to visualize:

jsonDF = jsonDF.withColumn("values_level1", explode("data.values"))
               .withColumn("name_level1", col("values_level1.name"))

Until here everything is fine, column "name_level1" contains certain values that I want to filter for at a later stage. The problem begins when I continue with

.withColumn("values_level2", explode("values_level1.values"))

because now the values I had in column "name_level1" have changed and I no longer find the values I want to filter for later.

Is this something expectable which I just havent understood yet conceptually (I am rather new to Spark...)? Can I somehow "conserve" the original values in "name_level1"?

Thanks a lot in advance!

edit: I am adding a picture that might help to understand my problem: enter image description here

Why does "name" change when I explode "values"?

CodePudding user response:

I talked to a colleague and he provided a solution that worked. Using

.withColumn("values_level2", explode_outer("values_level1.values"))

keeps the values I need in column "name_level1"

  • Related