Let's suppose that we have the following two tables
--------- --------
|AUTHOR_ID| NAME |
--------- --------
| 102 |Camus |
| 103 |Hugo |
--------- -------- ------------
|AUTHOR_ID| BOOK_ID BOOK_NAME |
--------- -------- -----------|
| 1 |Camus | Etranger
| 1 |Hugo | Mesirable |
I want to join the two table in order to get a DataFrame with the following Schema
root
|-- AUTHORID: integer
|-- NAME: string
|-- BOOK_LIST: array
| |-- BOOK_ID: integer
| |-- BOOK_NAME: string
I'm using pyspark, Thanks in advance
CodePudding user response:
Simple join group by should do the job:
from pyspark.sql import functions as F
result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
.groupBy("AUTHOR_ID", "NAME")
.agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
)
In the aggregation we use collect_list
to create the array of structs.