Home > database >  Pyspark - filter dataframe based on nested structs
Pyspark - filter dataframe based on nested structs

Time:02-19

Let's suppose that we have the following dataframe schema

root
 |-- AUTHOR_ID: integer (nullable = false)
 |-- NAME: string (nullable = true)
 |-- Books: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- BOOK_ID: integer (nullable = false)
 |    |    |-- Chapters: array (nullable = true) 
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- NAME: string (nullable = true)
 |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)
  • How to find the authors that have books with NUMBER_PAGES < 100

Thanks

CodePudding user response:

According to your data structure, the NUMBER_PAGES for a given BOOK_ID is equal to the sum of NUMBER_PAGES for each of its chapters.

You can use aggregate function to calculate the number of pages for each book then use filter with exists function:

from pyspark.sql import functions as F

df1 = df.filter(
    F.exists(
        "Books",
        lambda x: F.aggregate(x["Chapters"], F.lit(0), lambda a, b: a   b) < F.lit(100)
    )
)

For Spark <3.1, you need to use expr for aggregate and exists functions:

df1 = df.filter(    
    F.expr("exists(Book, x -> aggregate(x.Chapters, 0, (a, b) -> a   b) < 100)")
)
  • Related