Spark JSON Read Nested Structured Strings as Structs-CodePudding

I have a file with many JSON records in it. Each record contains a Struct ('Properties') and within each, a String that looks like this:

'meterDetails: "@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}"'

Note that values are not enclosed in a "".

I want to treat this column (meterDetails) as another Struct in my DF as all structs will be flattened eventually.

Proceeding with defining a schema, removing the @ with regexp_replace('col','@','') and using from_json with the schema resulted in a new col in Json format, but all NULL values.

Splitting the col with split(col("meterDetails"),";")) turns it into an Array, but upon conversion to Json - back to all NULL values.

Question:

I'm clearly misunderstanding the @{..} structured passed by this API. In Spark, should I convert this string to an object that natively will result to a Struct?

CodePudding user response：

To transform your dataset into an operable JSON file, I would first use regexp_extract to get whatever is within @{ and }, that is done through:

df1 = df1.withColumn("data", regexp_extract(col("string"), "\\{(.*)\\}", 1))

Now, data looks as below:

meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000

Next, I would split on ; and transform your data into a named_struct:

df1 = df1.withColumn("data", split(col("data"), ";"))

df1 = df1.withColumn("data",
  expr("transform(data, x -> named_struct('key',split(x, \"=\")[0],'value',split(x, \"=\")[1]))")
)

Now, data looks as:

[{meterName, Read Operations}, { meterCategory, Storage}, { meterSubCategory, General Block Blob}, { unitOfMeasure, 100000000}]

where col(data)[0].key gives meterName for example.

For more details, let's say you want another column to extract keys only (not, key and value are hard-coded from the previous step):

df1 = df1.withColumn("otherData", expr("transform(data, x -> x.key)"))

The result:

[meterName,  meterCategory,  meterSubCategory,  unitOfMeasure]

I hope this is what you are looking for, good luck!

CodePudding user response：

Somehow I gravitate towards str_to_map function, but then you will need to transform the map to struct. I'm not sure this is the best way, but I'd do it like this.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [("@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}",)],
    ['meterDetails'])
df.show(truncate=0)
#  ----------------------------------------------------------------------------------------------------------------- 
# |meterDetails                                                                                                     |
#  ----------------------------------------------------------------------------------------------------------------- 
# |@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}|
#  -----------------------------------------------------------------------------------------------------------------

Script:

# Converting string to map
df = df.withColumn(
    "meterDetails",
    F.expr("str_to_map(TRIM(BOTH '@{}' FROM meterDetails), '; ', '=')")
)
# Converting map to struct
df = df.withColumn("meterDetails", F.to_json("meterDetails"))
json_schema = spark.read.json(df.rdd.map(lambda r: r.meterDetails)).schema
df = df.withColumn("meterDetails", F.from_json("meterDetails", json_schema))

df.show(truncate=0)
#  --------------------------------------------------------- 
# |meterDetails                                             |
#  --------------------------------------------------------- 
# |{Storage, Read Operations, General Block Blob, 100000000}|
#  --------------------------------------------------------- 
df.printSchema()
# root
#  |-- meterDetails: struct (nullable = true)
#  |    |-- meterCategory: string (nullable = true)
#  |    |-- meterName: string (nullable = true)
#  |    |-- meterSubCategory: string (nullable = true)
#  |    |-- unitOfMeasure: string (nullable = true)