Extract first fields from struct columns into a dictionary-CodePudding

I need to create a dictionary from Spark dataframe's schema of type pyspark.sql.types.StructType.

The code needs to go through entire StructType, find only those StructField elements which are of type StructType and, when extracting into dictionary, use the name of parent StructField as key while value would be name of only the first nested/child StructField.

Example schema (StructType):

root
|-- field_1: int
|-- field_2: int
|-- field_3: struct
|    |-- date: date
|    |-- timestamp: timestamp
|-- field_4: int

Desired result:

{"field_3": "date"}

CodePudding user response：

You can use a dictionary comprehension navigating through the schema.

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}

Test #1

df = spark.createDataFrame([], 'field_1 int, field_2 int, field_3 struct<date:date,timestamp:timestamp>, field_4 int')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: integer (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)
#  |-- field_4: integer (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_3': 'date'}

Test #2

df = spark.createDataFrame([], 'field_1 int, field_2 struct<col_int:int,col_long:long>, field_3 struct<date:date,timestamp:timestamp>')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: struct (nullable = true)
#  |    |-- col_int: integer (nullable = true)
#  |    |-- col_long: long (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_2': 'col_int', 'field_3': 'date'}