Home > Enterprise >  How to parse array column in the spark that contains the list of key-> value pair
How to parse array column in the spark that contains the list of key-> value pair

Time:07-13

I have a dataframe that contains five columns (A1,B1,C1,D1,E1)

Column E1 is an Array that contains a list [[a1 -> V1, a2-> V2, a3 -> V3]]

Could you please guide me how can I get the data as format below in PySpark or Spark SQL:

A1 B1 C1 D1 a1 a2 a3

Note that column E1 has special symbol "->" between key and value.

Dataframe's schema:

root 
|-- A1: integer (nullable = true) 
|-- B1: string (nullable = true) 
|-- C1: string (nullable = true) 
|-- D1: long (nullable = true) 
|-- E1: array (nullable = true) 
|   |-- element: map (containsNull = true) 
|   |   |-- key: string 
|   |   |-- value: string (valueContainsNull = true) 

CodePudding user response:

You could go through every map in your array and select elements accordingly. Python's enumerate may be of use.

df = df.select(
    'A1', 'B1', 'C1', 'D1',
    *[F.col('E1')[i][c].alias(c) for i, c in enumerate(['a1', 'a2', 'a3'])]
)

Full test:

from pyspark.sql import functions as F

# Creating example DF
df = spark.createDataFrame(
    [(1, '1', '2', 2, 'a1', 'V1', 'a2', 'V2', 'a3', 'V3'),
     (3, '3', '4', 4, 'a1', 'V3', 'a2', 'V4', 'a3', 'V5')],
    ['A1', 'B1', 'C1', 'D1', 'K1', 'V1', 'K2', 'V2', 'K3', 'V3'])
df = df.select(
    'A1', 'B1', 'C1', 'D1',
    F.array(
        F.create_map('K1', 'V1'),
        F.create_map('K2', 'V2'),
        F.create_map('K3', 'V3'),
    ).alias('E1')
)
df.show(truncate=0)
#  --- --- --- --- ------------------------------------ 
# |A1 |B1 |C1 |D1 |E1                                  |
#  --- --- --- --- ------------------------------------ 
# |1  |1  |2  |2  |[{a1 -> V1}, {a2 -> V2}, {a3 -> V3}]|
# |3  |3  |4  |4  |[{a1 -> V3}, {a2 -> V4}, {a3 -> V5}]|
#  --- --- --- --- ------------------------------------ 

# Script
df = df.select(
    'A1', 'B1', 'C1', 'D1',
    *[F.col('E1')[i][c].alias(c) for i, c in enumerate(['a1', 'a2', 'a3'])]
)
df.show(truncate=0)
#  --- --- --- --- --- --- --- 
# |A1 |B1 |C1 |D1 |a1 |a2 |a3 |
#  --- --- --- --- --- --- --- 
# |1  |1  |2  |2  |V1 |V2 |V3 |
# |3  |3  |4  |4  |V3 |V4 |V5 |
#  --- --- --- --- --- --- --- 
  • Related