Home > Blockchain >  Flatten dataframe with nested struct ArrayType using pyspark
Flatten dataframe with nested struct ArrayType using pyspark

Time:02-28

I have a dataframe with this schema

root
  |-- AUTHOR_ID: integer (nullable = false)
  |-- NAME: string (nullable = true)
  |-- Books: array (nullable = false)
  |    |-- element: struct (containsNull = false)
  |    |    |-- BOOK_ID: integer (nullable = false)
  |    |    |-- Chapters: array (nullable = true) 
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- NAME: string (nullable = true)
  |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

How to flat all columns into one level with Pyspark ?

CodePudding user response:

Using inline function:

df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
       .selectExpr("*", "inline(Chapters)")
       .drop("Chapters")
       )

Or explode:

from pyspark.sql import functions as F

df2 = (df.withColumn("Books", F.explode("Books"))
       .select("*", "Books.*")
       .withColumn("Chapters", F.explode("Chapters"))
       .select("*", "Chapters.*")
       )
  • Related