Home > Back-end >  How to rename the first level keys of struct with PySpark in Azure Databricks?
How to rename the first level keys of struct with PySpark in Azure Databricks?

Time:10-13

I would like to rename the keys of the first level objects inside my payload.

from pyspark.sql.functions import *  
ds = {'Fruits': {'apple': {'color': 'red'},'mango': {'color': 'green'}}, 'Vegetables': None}
df = spark.read.json(sc.parallelize([ds]))
df.printSchema()
"""
root
 |-- Fruits: struct (nullable = true)
 |    |-- apple: struct (nullable = true)
 |    |    |-- color: string (nullable = true)
 |    |    |-- shape: string (nullable = true)
 |    |-- mango: struct (nullable = true)
 |    |    |-- color: string (nullable = true)
 |-- Vegetables: string (nullable = true)
"""

Desired output:

root
 |-- Fruits: struct (nullable = true)
 |    |-- APPLE: struct (nullable = true)
 |    |    |-- color: string (nullable = true)
 |    |    |-- shape: string (nullable = true)
 |    |-- MANGO: struct (nullable = true)
 |    |    |-- color: string (nullable = true)
 |-- Vegetables: string (nullable = true)

In this case I would like to rename the keys in the first level to uppercase.

If I had a map type I could use transform keys:

df.select(transform_keys("Fruits", lambda k, _: upper(k)).alias("data_upper")).display()

Unfortunately, I have a struct type.

AnalysisException: cannot resolve 'transform_keys(Fruits, lambdafunction(upper(x_18), x_18, y_19))' due to argument data type mismatch: argument 1 requires map type, however, 'Fruits' is of structapple:struct<color:string,shape:string,mango:structcolor:string> type.;

I'm using Databricks runtime 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12).

CodePudding user response:

The function you tried to use (transform_keys) is for map type columns. Your column type is struct.

You could use withField.

from pyspark.sql import functions as F
ds = spark.createDataFrame([], 'Fruits struct<apple:struct<color:string,shape:string>,mango:struct<color:string>>, Vegetables string')
ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- apple: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- mango: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)

ds = ds.withColumn('Fruits', F.col('Fruits').withField('APPLE', F.col('Fruits.apple')))
ds = ds.withColumn('Fruits', F.col('Fruits').withField('MANGO', F.col('Fruits.mango')))

ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- APPLE: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- MANGO: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)

You can also recreate the structure, but you will need to include all of the struct fields when recreating.

ds = ds.withColumn('Fruits', F.struct(
    F.col('Fruits.apple').alias('APPLE'),
    F.col('Fruits.mango').alias('MANGO'),
))

ds.printSchema()
# root
#  |-- Fruits: struct (nullable = true)
#  |    |-- APPLE: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |    |    |-- shape: string (nullable = true)
#  |    |-- MANGO: struct (nullable = true)
#  |    |    |-- color: string (nullable = true)
#  |-- Vegetables: string (nullable = true)
  • Related