Home > OS >  How to concatenate nested json in Apache Spark
How to concatenate nested json in Apache Spark

Time:11-15

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.

I'm using the following code:

df = (df
       .withColumn("ingestion_date", current_timestamp())
       .withColumn("name", concat(col("name.forename"), 
             lit(" "), col("name.surname"))))
     )

Schema:

root  
|-- driverRef: string (nullable = true)  
|-- number: integer (nullable = true)  
|-- code: string (nullable = true)  
|-- forename: string (nullable = true)  
|-- surname: string (nullable = true)  
|-- dob: date (nullable = true)

As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:

enter image description here

After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.

My code produces the following error:

Can't extract value from name#6976: need struct type but got string

CodePudding user response:

It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.

That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.

You just need to tell spark that this string column is a JSON and how to parse it.

from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f

# initializing data
df = spark.range(1).withColumn('name',
       f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)
 --- --------------------------------------------- 
|id |name                                         |
 --- --------------------------------------------- 
|0  |{"forename": "Lewis", "surname" : "Hamilton"}|
 --- --------------------------------------------- 

And parsing that JSON:

json_schema = StructType([
    StructField('forename', StringType()),
    StructField('surname', StringType())
])

df\
    .withColumn('s', f.from_json(f.col('name'), json_schema))\
    .withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
    .show()
 --- -------------- ----------------- 
| id|          name|                s|
 --- -------------- ----------------- 
|  0|Lewis Hamilton|{Lewis, Hamilton}|
 --- -------------- ----------------- 

You may than get rid of s with drop, it contains the parsed struct.

  • Related