How I can convert a array (in a column) with a set of elements in a JSON dataset to multiple columns with python, spark or pandas? The data is structured in this form:
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- idAccount: long (nullable = true)
| | |-- infractionType: string (nullable = true)
| | |-- responseTime: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- transactionCode: string (nullable = true)
I'm expecting some kind of this:
id | idAccount |
---|---|
value | value |
value | value |
CodePudding user response:
In Spark SQL, you can access the item in ArrayType
or MapType
column by using getItem
. For example, you want to get the value of the id of first item, you can use df.select(func.getItem(0).getItem('id'))
CodePudding user response:
an array of structs can be exploded into columns using the inline
sql function.
here's an example of how it works.
data_sdf = spark.createDataFrame([([(1234, 2345, 3456), (4321, 5432, 6543)],)],
'items array<struct<id: int, id_acc: int, foo: int>>'
)
# ----------------------------------------
# |items |
# ----------------------------------------
# |[{1234, 2345, 3456}, {4321, 5432, 6543}]|
# ----------------------------------------
# root
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- id: integer (nullable = true)
# | | |-- id_acc: integer (nullable = true)
# | | |-- foo: integer (nullable = true)
# explode and create new columns using struct fields - using `inline`
data_sdf. \
selectExpr('inline(items)'). \
show()
# ---- ------ ----
# | id|id_acc| foo|
# ---- ------ ----
# |1234| 2345|3456|
# |4321| 5432|6543|
# ---- ------ ----
you can further just select()
the required fields after the explosion.