Combine two dataframes with structs in pyspark-CodePudding

I have two dataframes (A and B) with the following schema

 root
     |-- AUTHOR_ID: integer (nullable = false)
     |-- NAME: string (nullable = true)
     |-- Books: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- BOOK_ID: integer (nullable = false)
     |    |    |-- Chapters: array (nullable = true) 
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- NAME: string (nullable = true)
     |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

What is the best and the cleaning way to combine the two dataframes and each item will be as a struct field in a new colmun, to get that as a result :

 --------- --------  ------------
|AUTHOR_ID| A         B          |     
 --------- --------   -----------|
|  1      | {}      |   {}       |   keep the nested structs in the new column
|         |         |            |

CodePudding user response：

Maybe the best way is to add some new columns with one simple join

Note: we suppose that the 2 dataframes df_A and df_B have the same schema

  columns = df_A.columns

  b_ = df_B.withColumnRenamed('AUTHOR_ID', 'ref_id')          # to retain null fields when no item match
  return df_A.withColumn('A', f.struct(columns))\
          .select('AUTHOR_ID', 'A')\
          .join(b_, df_A['AUTHOR_ID'] == b_.ref_id, 'left')\
          .withColumn('B', f.when(f.col('ref_id').isNotNull(), f.struct(*[columns])).otherwise(f.lit(None)))\
          .select('AUTHOR_ID', 'A', 'B')