Home > database >  Pyspark stack to fill existing column
Pyspark stack to fill existing column

Time:09-21

I have a data frame like below

col1, col2 ... coln, key, value

I got the key, value after exploding a col with format map<string, string>

Now I want to stack col2 to coln and insert them in the key, value columns. I want the previous values in key, value cols to remain as it is.

How to achieve this? One solution I can think of is pivot the key, value column and then create the stack expression. Anything better and faster than this?

Edit - Input input

Output

Output

CodePudding user response:

You can add the new elements within the map column before exploding it. It will be easier to process.

Here's an example using a sample data similar to your sample's structure.

data_sdf.show()

#  --- --- --- --- ------------------------ 
# |id |c1 |c2 |c3 |c4                      |
#  --- --- --- --- ------------------------ 
# |abc|1  |2  |3  |{key_b -> 5, key_a -> 4}|
#  --- --- --- --- ------------------------ 

col_list = ['c1', 'c2', 'c3']

data_sdf. \
    withColumn('col2structarr', 
               func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('value'))
                            for k in col_list]
                          )
               ). \
    withColumn('map2structarr', func.map_entries('c4')). \
    withColumn('structarrs_merged', func.array_union('map2structarr', 'col2structarr')). \
    withColumn('mergedstructarr2map', func.map_from_entries('structarrs_merged')). \
    selectExpr('id', 'explode(mergedstructarr2map)'). \
    show(truncate=False)

#  --- ----- ----- 
# |id |key  |value|
#  --- ----- ----- 
# |abc|key_b|5    |
# |abc|key_a|4    |
# |abc|c1   |1    |
# |abc|c2   |2    |
# |abc|c3   |3    |
#  --- ----- ----- 
  • a map column is like a dictionary and has key-value pairs. So, when it is converted to array of structs, it generates 2 fields within the structs - key, value.
  • map2structarr - convert the map column to array of structs, so that we can use array_union for the new element addition.
  • col2structarr - using the aforementioned info on map columns and array of structs, we can create an array of structs for the columns that we need to add in the map column.
  • structarrs_merged - merge the 2 arrays created above (1 from map column, and 1 from the columns we want to add) using array_join. The function requires the arrays to be of same data type.
  • mergedstructarr2map - an array of strut can be converted to map type column using map_from_entries.
  • use explode to create key and value columns from the final map type column.

Here is a snapshot of all the individual columns created above (before the explode)

 --- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- --------------------------------------------------- 
|id |c1 |c2 |c3 |c4                      |col2structarr              |map2structarr           |structarrs_merged                                  |mergedstructarr2map                                |
 --- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- --------------------------------------------------- 
|abc|1  |2  |3  |{key_b -> 5, key_a -> 4}|[{c1, 1}, {c2, 2}, {c3, 3}]|[{key_b, 5}, {key_a, 4}]|[{key_b, 5}, {key_a, 4}, {c1, 1}, {c2, 2}, {c3, 3}]|{key_b -> 5, key_a -> 4, c1 -> 1, c2 -> 2, c3 -> 3}|
 --- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- --------------------------------------------------- 

root
 |-- id: string (nullable = true)
 |-- c1: long (nullable = true)
 |-- c2: long (nullable = true)
 |-- c3: long (nullable = true)
 |-- c4: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- col2structarr: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- map2structarr: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- structarrs_merged: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: long (nullable = true)
 |-- mergedstructarr2map: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
  • Related