I have a dataframe that contains five columns (A1,B1,C1,D1,E1)
Column E1 is an Array that contains a list [[a1 -> V1, a2-> V2, a3 -> V3]]
Could you please guide me how can I get the data as format below in PySpark or Spark SQL:
A1 B1 C1 D1 a1 a2 a3
Note that column E1 has special symbol "->" between key and value.
Dataframe's schema:
root
|-- A1: integer (nullable = true)
|-- B1: string (nullable = true)
|-- C1: string (nullable = true)
|-- D1: long (nullable = true)
|-- E1: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
CodePudding user response:
You could go through every map in your array and select elements accordingly. Python's enumerate
may be of use.
df = df.select(
'A1', 'B1', 'C1', 'D1',
*[F.col('E1')[i][c].alias(c) for i, c in enumerate(['a1', 'a2', 'a3'])]
)
Full test:
from pyspark.sql import functions as F
# Creating example DF
df = spark.createDataFrame(
[(1, '1', '2', 2, 'a1', 'V1', 'a2', 'V2', 'a3', 'V3'),
(3, '3', '4', 4, 'a1', 'V3', 'a2', 'V4', 'a3', 'V5')],
['A1', 'B1', 'C1', 'D1', 'K1', 'V1', 'K2', 'V2', 'K3', 'V3'])
df = df.select(
'A1', 'B1', 'C1', 'D1',
F.array(
F.create_map('K1', 'V1'),
F.create_map('K2', 'V2'),
F.create_map('K3', 'V3'),
).alias('E1')
)
df.show(truncate=0)
# --- --- --- --- ------------------------------------
# |A1 |B1 |C1 |D1 |E1 |
# --- --- --- --- ------------------------------------
# |1 |1 |2 |2 |[{a1 -> V1}, {a2 -> V2}, {a3 -> V3}]|
# |3 |3 |4 |4 |[{a1 -> V3}, {a2 -> V4}, {a3 -> V5}]|
# --- --- --- --- ------------------------------------
# Script
df = df.select(
'A1', 'B1', 'C1', 'D1',
*[F.col('E1')[i][c].alias(c) for i, c in enumerate(['a1', 'a2', 'a3'])]
)
df.show(truncate=0)
# --- --- --- --- --- --- ---
# |A1 |B1 |C1 |D1 |a1 |a2 |a3 |
# --- --- --- --- --- --- ---
# |1 |1 |2 |2 |V1 |V2 |V3 |
# |3 |3 |4 |4 |V3 |V4 |V5 |
# --- --- --- --- --- --- ---