Move deeply nested fields one level up in pyspark dataframe-CodePudding

I have a pyspark dataframe created from XML. Because of the way XML is structured I have an extra, unnecessary level of nesting in the schema of the dataframe.

The schema of my current dataframe:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

I'm trying to replace the movies struct with the movie array underneath it as follows:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)

The closest I've gotten was using:

from pyspark.sql import functions as F

df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))

which results in the following schema:

root
|-- a: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- movies: struct (nullable = true)
|    |    |    |-- movie: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)
|    |    |-- f: string (nullable = true)
|    |    |-- g: string (nullable = true)
|    |    |-- movies_new: array (nullable = true)
|    |    |    |-- element: array (containsNull = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- b: string (nullable = true)
|    |    |    |    |    |-- c: string (nullable = true)
|    |    |    |    |    |-- d: integer (nullable = true)
|    |    |    |    |    |-- e: string (nullable = true)

I understand why this is happening, but thought if I never extracted the nested array out of 'a' that it might not become an array of an array.

Any suggestions?

CodePudding user response：

The logic is:

Explode array "a".
Recompute new struct as (movies.movie, f, g)
Collect "a" back as array.

df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

The full working code:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[
    [[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")

df.printSchema()
# df.show(truncate=False)

df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
                                 df.a.movies.getField("movie").alias("movies"), \
                                 df.a.f.alias("f"), \
                                 df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))

df.printSchema()
# df.show(truncate=False)

Output schema before:

root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- movies: struct (nullable = true)
 |    |    |    |-- movie: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)

Output schema after:

root
 |-- a: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- movies: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |-- c: string (nullable = true)
 |    |    |    |    |-- d: string (nullable = true)
 |    |    |    |    |-- e: string (nullable = true)
 |    |    |-- f: string (nullable = true)
 |    |    |-- g: string (nullable = true)