Let's suppose that we have the following dataframe schema
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
- How to find the authors that have books with
NUMBER_PAGES < 100
Thanks
CodePudding user response:
According to your data structure, the NUMBER_PAGES
for a given BOOK_ID
is equal to the sum of NUMBER_PAGES
for each of its chapters.
You can use aggregate
function to calculate the number of pages for each book then use filter with exists
function:
from pyspark.sql import functions as F
df1 = df.filter(
F.exists(
"Books",
lambda x: F.aggregate(x["Chapters"], F.lit(0), lambda a, b: a b) < F.lit(100)
)
)
For Spark <3.1, you need to use expr
for aggregate and exists functions:
df1 = df.filter(
F.expr("exists(Book, x -> aggregate(x.Chapters, 0, (a, b) -> a b) < 100)")
)