Handle spark DataFrame structure-CodePudding

Let's suppose that we have the following two tables

 --------- -------- 
|AUTHOR_ID| NAME   |     
 --------- -------- 
|  102    |Camus   |
|  103    |Hugo    |

 --------- --------  ------------
|AUTHOR_ID| BOOK_ID   BOOK_NAME  |     
 --------- --------   -----------|
|  1      |Camus    | Etranger
|  1      |Hugo     | Mesirable  |

I want to join the two table in order to get a DataFrame with the following Schema

root
 |-- AUTHORID: integer
 |-- NAME: string 
 |-- BOOK_LIST: array 
 |    |-- BOOK_ID: integer 
 |    |-- BOOK_NAME: string

I'm using pyspark, Thanks in advance

CodePudding user response：

Simple join group by should do the job:

from pyspark.sql import functions as F

result = (df_authors.join(df_books, on=["AUTHOR_ID"], how="left")
          .groupBy("AUTHOR_ID", "NAME")
          .agg(F.collect_list(F.struct("BOOK_ID", "BOOK_NAME")))
          )

In the aggregation we use collect_list to create the array of structs.