Home > Software design >  withField in Spark SQL
withField in Spark SQL

Time:08-04

In PySpark, we can use withField:

from pyspark.sql import functions as F

df = spark.createDataFrame([((1,),)], ['mystruct'])
df = df.withColumn('mystruct', F.col('mystruct').withField('n', F.lit(3)))

df.printSchema()
# root
#  |-- mystruct: struct (nullable = true)
#  |    |-- _1: long (nullable = true)
#  |    |-- n: integer (nullable = false)

How to use this in Spark SQL?

I've tried looking at the logical plan and applying the result to expr:

df.explain(True)
# == Parsed Logical Plan ==
# 'Project [update_fields('mystruct, WithField(n, 3)) AS mystruct#84]
#  - LogicalRDD [mystruct#82], false
# ...

df = spark.createDataFrame([((1,),)], ['mystruct'])
df = df.withColumn('mystruct', F.expr("update_fields(mystruct, WithField(n, 3))"))

Undefined function: update_fields. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.update_fields.; line 1 pos 0

CodePudding user response:

At this time, there no such support for SQL API.

You'll have to recreate the whole struct in order to update it, using either named_struct or struct functions.

To reference existing fields, you can use mystruct.* as pointed in comments by @samkart:

df.createOrReplaceTempView("df")

df = spark.sql("SELECT struct(mystruct.*, 3 as n) as mystruct FROM df")

df.printSchema()
# root
#  |-- mystruct: struct (nullable = false)
#  |    |-- _1: long (nullable = true)
#  |    |-- n: integer (nullable = false)

Note that you're looking at the logical plan, if you look at the physical plan you'll see the sql expression being applied:

# == Physical Plan ==
# *(1) Project [if (isnull(mystruct#0)) null else named_struct(_1, mystruct#0._1, n, 3) AS mystruct#2]
#  - *(1) Scan ExistingRDD[mystruct#0]
  • Related