Home > other >  How to create a count of nested JSON objects in a DataFrame row using Spark/Scala
How to create a count of nested JSON objects in a DataFrame row using Spark/Scala

Time:05-19

I have a column full of JSON object strings that look like this:

"steps":{
    "step_1":{
        "conditions":{
        "complete_by":"2022-05-17",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy"
            },
    "step_2":{
        "conditions":{
        "complete_by":"2022-05-27",
        "requirement":100
                     },
        "status":"eligible",
        "type":"buy" 
}

In the steps object, there can be any number of steps (within reason).

My question is, how would I create another Dataframe column that counts the number of steps for each JSON string in that row/column?

I'm using Spark/Scala, so I created a UDF with the below:

def jsonCount (col):

val jsonCountUDF = udf(jsonCount)

val stepDF = stepData.withColumn("NumberOfSteps", jsonCountUDF(col("steps")))

This is where I get stuck. I'd like to loop though each row in the steps column and count the step objects in the steps object JSON string. Does anyone have experience with a similar task or know of a function to simplify this?

CodePudding user response:

#make some data
str = "{\"steps\":{ \"step_1\":{\"conditions\":{ \"complete_by\":\"2022-05-17\", \"requirement\":100} }  , \"step_2\":{  \"status\":\"eligible\", \"type\":\"buy\"   }  }}"

#implement a function to return the count
def jsonCount ( jsonString ):
 import json
 json_obj = json.loads(jsonString)
 return len( json_obj["steps"] )

#define the udf
JSONCount = udf(jsonCount, IntegerType())

#create sample dataframe
df = spark.createDataFrame( [ [str] ], ["json"] )

#run udf on dataframe
df.select( df.json, JSONCount( df.json ).alias("StepCount") ).show()

 -------------------- --------- 
|                json|StepCount|
 -------------------- --------- 
|{"steps":{ "step_...|        2|
 -------------------- --------- 

CodePudding user response:

You can try to select that sub struct, then get columns size.

  stepSize=  df.select($"steps.*").columns.size

then add it to your df

df_steps = df.withColumn("NumberOfSteps",lit(stepSize))
  • Related