I have a pyspark dataframe created from XML. Because of the way XML is structured I have an extra, unnecessary level of nesting in the schema of the dataframe.
The schema of my current dataframe:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
I'm trying to replace the movies struct with the movie array underneath it as follows:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: integer (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
The closest I've gotten was using:
from pyspark.sql import functions as F
df.withColumn("a", F.transform('a', lambda x: x.withField("movies_new", F.col("a.movies.movie"))))
which results in the following schema:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
| | |-- movies_new: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: integer (nullable = true)
| | | | | |-- e: string (nullable = true)
I understand why this is happening, but thought if I never extracted the nested array out of 'a' that it might not become an array of an array.
Any suggestions?
CodePudding user response:
The logic is:
- Explode array "a".
- Recompute new struct as (movies.movie, f, g)
- Collect "a" back as array.
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
The full working code:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[
[[(([("b1", "c1", "d1", "e1")],), "f1", "g1")]]
], schema="a array<struct<movies struct<movie array<struct<b string, c string, d string, e string>>>, f string, g string>>")
df.printSchema()
# df.show(truncate=False)
df = df.withColumn("a", F.explode("a"))
df = df.withColumn("a", F.struct( \
df.a.movies.getField("movie").alias("movies"), \
df.a.f.alias("f"), \
df.a.g.alias("g")))
df = df.select(F.collect_list("a").alias("a"))
df.printSchema()
# df.show(truncate=False)
Output schema before:
root
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- movies: struct (nullable = true)
| | | |-- movie: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- b: string (nullable = true)
| | | | | |-- c: string (nullable = true)
| | | | | |-- d: string (nullable = true)
| | | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)
Output schema after:
root
|-- a: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- movies: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- b: string (nullable = true)
| | | | |-- c: string (nullable = true)
| | | | |-- d: string (nullable = true)
| | | | |-- e: string (nullable = true)
| | |-- f: string (nullable = true)
| | |-- g: string (nullable = true)