Home > Software design >  Combine two dataframes with structs in pyspark
Combine two dataframes with structs in pyspark

Time:03-08

I have two dataframes (A and B) with the following schema

 root
     |-- AUTHOR_ID: integer (nullable = false)
     |-- NAME: string (nullable = true)
     |-- Books: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- BOOK_ID: integer (nullable = false)
     |    |    |-- Chapters: array (nullable = true) 
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- NAME: string (nullable = true)
     |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

What is the best and the cleaning way to combine the two dataframes and each item will be as a struct field in a new colmun, to get that as a result :

 --------- --------  ------------
|AUTHOR_ID| A         B          |     
 --------- --------   -----------|
|  1      | {}      |   {}       |   keep the nested structs in the new column
|         |         |            |

CodePudding user response:

Maybe the best way is to add some new columns with one simple join

  • Note: we suppose that the 2 dataframes df_A and df_B have the same schema

      columns = df_A.columns
    
      b_ = df_B.withColumnRenamed('AUTHOR_ID', 'ref_id')          # to retain null fields when no item match
      return df_A.withColumn('A', f.struct(columns))\
              .select('AUTHOR_ID', 'A')\
              .join(b_, df_A['AUTHOR_ID'] == b_.ref_id, 'left')\
              .withColumn('B', f.when(f.col('ref_id').isNotNull(), f.struct(*[columns])).otherwise(f.lit(None)))\
              .select('AUTHOR_ID', 'A', 'B')
    
  • Related