Home > OS >  Extracting value from json from spark table gives SyntaxError error or keyType should be DataType er
Extracting value from json from spark table gives SyntaxError error or keyType should be DataType er

Time:12-28

Lets suppose I have such dataframe org below:

id |raw
123|{"inn":"123”, "prof": "tkie"}

I need to extract from json from column raw all ids values to new columns, how to do that?

I tried:

org.withColumn('inn', from_json($"raw", MapType(StringType, StringType))).withColumn('inn', col('searchcard'.getItem('inn')))

got error:

File "", line 1 org.withColumn('inn', from_json($"raw", MapType(StringType, StringType))).withColumn('inn', col('searchcard'.getItem('inn'))) ^ SyntaxError: invalid syntax

And:

org.withColumn('inn', from_json("raw", MapType(StringType, StringType))).withColumn('inn', col('searchcard'.getItem('inn')))

error:

AssertionError Traceback (most recent call last) in () ----> 1 org.withColumn('inn', from_json("raw", MapType(StringType, StringType))).withColumn('inn', col('searchcard'.getItem('inn')))
/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/types.py in init(self, keyType, valueType, valueContainsNull) 342 False 343 """ --> 344 assert isinstance(keyType, DataType), "keyType should be DataType" 345 assert isinstance(valueType, DataType), "valueType should be DataType" 346 self.keyType = keyType AssertionError: keyType should be DataType

CodePudding user response:

This method get_json_object did the trick!

org.withColumn('inn',get_json_object(org.raw, '$.inn')).show(1)

CodePudding user response:

You code contains several problems:

  • $"raw" is used with Scala API, in Pyspark use col("raw") or directly as string "raw"
  • When using StringType or any other type, in python you need to add parentheses StringType()
  • getItem is a Column method but your calling it in string ('searchcard'.getItem('inn'))

Here's the same code corrected with complete working example using from_json function:

import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType

org = spark.createDataFrame([
    (123, '{"inn":"123", "prof": "tkie"}')
], ["id", "raw"])

org.withColumn(
    'raw',
    F.from_json("raw", MapType(StringType(), StringType()))
).select(
    'id',
    F.col('raw').getItem('inn').alias('inn')
).show()

# --- --- 
#| id|inn|
# --- --- 
#|123|123|
# --- --- 
  • Related