Home > Blockchain >  How to chain explode and struct field selection?
How to chain explode and struct field selection?

Time:06-29

The dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame([([(1, 2), (3, 4)],)], 'col_name array<struct<c1:int,c2:int>>')

df.show()
#  ---------------- 
# |        col_name|
#  ---------------- 
# |[{1, 2}, {3, 4}]|
#  ---------------- 

df.printSchema()
# root
#  |-- col_name: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- c1: integer (nullable = true)
#  |    |    |-- c2: integer (nullable = true)

I explode the array (the result is a column of type struct<c1:int,c2:int>).
And then select every struct field (but I select twice):

df = df.select(
    F.explode('col_name')
).select(
    [f'col.{c}' for c in ('c1', 'c2')]
)
df.show()
#  --- --- 
# | c1| c2|
#  --- --- 
# |  1|  2|
# |  3|  4|
#  --- --- 

df.printSchema()
# root
#  |-- c1: integer (nullable = true)
#  |-- c2: integer (nullable = true)

I know I can shorten the second select to just 'col.*'. But I would still have 2 selects.

Question. Is there a method to select struct fields right after the explode with only 1 select?

As the result of the explode has schema struct<c1:int,c2:int>, I thought this would work...

df = df.select(
    [F.explode('col_name')[c] for c in ('c1', 'c2')]
)

AnalysisException: No such struct field c1 in col

CodePudding user response:

Use the magic inline

df.selectExpr('inline(col_name)').show()

 --- --- 
| c1| c2|
 --- --- 
|  1|  2|
|  3|  4|
 --- --- 
  • Related