I have a data frame like below
col1, col2 ... coln, key, value
I got the key, value after exploding a col with format map<string, string>
Now I want to stack col2 to coln and insert them in the key, value columns. I want the previous values in key, value cols to remain as it is.
How to achieve this? One solution I can think of is pivot the key, value column and then create the stack expression. Anything better and faster than this?
Output
CodePudding user response:
You can add the new elements within the map column before exploding it. It will be easier to process.
Here's an example using a sample data similar to your sample's structure.
data_sdf.show()
# --- --- --- --- ------------------------
# |id |c1 |c2 |c3 |c4 |
# --- --- --- --- ------------------------
# |abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|
# --- --- --- --- ------------------------
col_list = ['c1', 'c2', 'c3']
data_sdf. \
withColumn('col2structarr',
func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('value'))
for k in col_list]
)
). \
withColumn('map2structarr', func.map_entries('c4')). \
withColumn('structarrs_merged', func.array_union('map2structarr', 'col2structarr')). \
withColumn('mergedstructarr2map', func.map_from_entries('structarrs_merged')). \
selectExpr('id', 'explode(mergedstructarr2map)'). \
show(truncate=False)
# --- ----- -----
# |id |key |value|
# --- ----- -----
# |abc|key_b|5 |
# |abc|key_a|4 |
# |abc|c1 |1 |
# |abc|c2 |2 |
# |abc|c3 |3 |
# --- ----- -----
- a map column is like a dictionary and has key-value pairs. So, when it is converted to array of structs, it generates 2 fields within the structs -
key
,value
. map2structarr
- convert the map column to array of structs, so that we can usearray_union
for the new element addition.col2structarr
- using the aforementioned info on map columns and array of structs, we can create an array of structs for the columns that we need to add in the map column.structarrs_merged
- merge the 2 arrays created above (1 from map column, and 1 from the columns we want to add) usingarray_join
. The function requires the arrays to be of same data type.mergedstructarr2map
- an array of strut can be converted to map type column usingmap_from_entries
.- use
explode
to createkey
andvalue
columns from the final map type column.
Here is a snapshot of all the individual columns created above (before the explode
)
--- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- ---------------------------------------------------
|id |c1 |c2 |c3 |c4 |col2structarr |map2structarr |structarrs_merged |mergedstructarr2map |
--- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- ---------------------------------------------------
|abc|1 |2 |3 |{key_b -> 5, key_a -> 4}|[{c1, 1}, {c2, 2}, {c3, 3}]|[{key_b, 5}, {key_a, 4}]|[{key_b, 5}, {key_a, 4}, {c1, 1}, {c2, 2}, {c3, 3}]|{key_b -> 5, key_a -> 4, c1 -> 1, c2 -> 2, c3 -> 3}|
--- --- --- --- ------------------------ --------------------------- ------------------------ --------------------------------------------------- ---------------------------------------------------
root
|-- id: string (nullable = true)
|-- c1: long (nullable = true)
|-- c2: long (nullable = true)
|-- c3: long (nullable = true)
|-- c4: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)
|-- col2structarr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- map2structarr: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- structarrs_merged: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- key: string (nullable = false)
| | |-- value: long (nullable = true)
|-- mergedstructarr2map: map (nullable = true)
| |-- key: string
| |-- value: long (valueContainsNull = true)