The dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame([([(1, 2), (3, 4)],)], 'col_name array<struct<c1:int,c2:int>>')
df.show()
# ----------------
# | col_name|
# ----------------
# |[{1, 2}, {3, 4}]|
# ----------------
df.printSchema()
# root
# |-- col_name: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- c1: integer (nullable = true)
# | | |-- c2: integer (nullable = true)
I explode
the array (the result is a column of type struct<c1:int,c2:int>
).
And then select every struct field (but I select
twice):
df = df.select(
F.explode('col_name')
).select(
[f'col.{c}' for c in ('c1', 'c2')]
)
df.show()
# --- ---
# | c1| c2|
# --- ---
# | 1| 2|
# | 3| 4|
# --- ---
df.printSchema()
# root
# |-- c1: integer (nullable = true)
# |-- c2: integer (nullable = true)
I know I can shorten the second select to just 'col.*'
. But I would still have 2 selects.
Question. Is there a method to select struct fields right after the explode with only 1 select?
As the result of the explode has schema struct<c1:int,c2:int>
, I thought this would work...
df = df.select(
[F.explode('col_name')[c] for c in ('c1', 'c2')]
)
AnalysisException: No such struct field c1 in col
CodePudding user response:
Use the magic inline
df.selectExpr('inline(col_name)').show()
--- ---
| c1| c2|
--- ---
| 1| 2|
| 3| 4|
--- ---