I have a file with many JSON records in it. Each record contains a Struct ('Properties') and within each, a String that looks like this:
'meterDetails: "@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}"'
Note that values are not enclosed in a "".
I want to treat this column (meterDetails) as another Struct in my DF as all structs will be flattened eventually.
Proceeding with defining a schema, removing the @ with regexp_replace('col','@','')
and using from_json
with the schema resulted in a new col in Json format, but all NULL values.
Splitting the col with split(col("meterDetails"),";"))
turns it into an Array, but upon conversion to Json - back to all NULL values.
Question:
I'm clearly misunderstanding the @{..} structured passed by this API. In Spark, should I convert this string to an object that natively will result to a Struct?
CodePudding user response:
To transform your dataset into an operable JSON file, I would first use regexp_extract
to get whatever is within @{
and }
, that is done through:
df1 = df1.withColumn("data", regexp_extract(col("string"), "\\{(.*)\\}", 1))
Now, data
looks as below:
meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000
Next, I would split on ;
and transform your data into a named_struct
:
df1 = df1.withColumn("data", split(col("data"), ";"))
df1 = df1.withColumn("data",
expr("transform(data, x -> named_struct('key',split(x, \"=\")[0],'value',split(x, \"=\")[1]))")
)
Now, data looks as:
[{meterName, Read Operations}, { meterCategory, Storage}, { meterSubCategory, General Block Blob}, { unitOfMeasure, 100000000}]
where col(data)[0].key
gives meterName
for example.
For more details, let's say you want another column to extract keys only (not, key
and value
are hard-coded from the previous step):
df1 = df1.withColumn("otherData", expr("transform(data, x -> x.key)"))
The result:
[meterName, meterCategory, meterSubCategory, unitOfMeasure]
I hope this is what you are looking for, good luck!
CodePudding user response:
Somehow I gravitate towards str_to_map
function, but then you will need to transform the map to struct. I'm not sure this is the best way, but I'd do it like this.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}",)],
['meterDetails'])
df.show(truncate=0)
# -----------------------------------------------------------------------------------------------------------------
# |meterDetails |
# -----------------------------------------------------------------------------------------------------------------
# |@{meterName=Read Operations; meterCategory=Storage; meterSubCategory=General Block Blob; unitOfMeasure=100000000}|
# -----------------------------------------------------------------------------------------------------------------
Script:
# Converting string to map
df = df.withColumn(
"meterDetails",
F.expr("str_to_map(TRIM(BOTH '@{}' FROM meterDetails), '; ', '=')")
)
# Converting map to struct
df = df.withColumn("meterDetails", F.to_json("meterDetails"))
json_schema = spark.read.json(df.rdd.map(lambda r: r.meterDetails)).schema
df = df.withColumn("meterDetails", F.from_json("meterDetails", json_schema))
df.show(truncate=0)
# ---------------------------------------------------------
# |meterDetails |
# ---------------------------------------------------------
# |{Storage, Read Operations, General Block Blob, 100000000}|
# ---------------------------------------------------------
df.printSchema()
# root
# |-- meterDetails: struct (nullable = true)
# | |-- meterCategory: string (nullable = true)
# | |-- meterName: string (nullable = true)
# | |-- meterSubCategory: string (nullable = true)
# | |-- unitOfMeasure: string (nullable = true)