I have a column full of JSON object strings that look like this:
"steps":{
"step_1":{
"conditions":{
"complete_by":"2022-05-17",
"requirement":100
},
"status":"eligible",
"type":"buy"
},
"step_2":{
"conditions":{
"complete_by":"2022-05-27",
"requirement":100
},
"status":"eligible",
"type":"buy"
}
In the steps object, there can be any number of steps (within reason).
My question is, how would I create another Dataframe column that counts the number of steps for each JSON string in that row/column?
I'm using Spark/Scala, so I created a UDF with the below:
def jsonCount (col):
val jsonCountUDF = udf(jsonCount)
val stepDF = stepData.withColumn("NumberOfSteps", jsonCountUDF(col("steps")))
This is where I get stuck. I'd like to loop though each row in the steps column and count the step objects in the steps object JSON string. Does anyone have experience with a similar task or know of a function to simplify this?
CodePudding user response:
#make some data
str = "{\"steps\":{ \"step_1\":{\"conditions\":{ \"complete_by\":\"2022-05-17\", \"requirement\":100} } , \"step_2\":{ \"status\":\"eligible\", \"type\":\"buy\" } }}"
#implement a function to return the count
def jsonCount ( jsonString ):
import json
json_obj = json.loads(jsonString)
return len( json_obj["steps"] )
#define the udf
JSONCount = udf(jsonCount, IntegerType())
#create sample dataframe
df = spark.createDataFrame( [ [str] ], ["json"] )
#run udf on dataframe
df.select( df.json, JSONCount( df.json ).alias("StepCount") ).show()
-------------------- ---------
| json|StepCount|
-------------------- ---------
|{"steps":{ "step_...| 2|
-------------------- ---------
CodePudding user response:
You can try to select that sub struct, then get columns size.
stepSize= df.select($"steps.*").columns.size
then add it to your df
df_steps = df.withColumn("NumberOfSteps",lit(stepSize))