I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode
on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName"""
where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?
CodePudding user response:
You can try this way:
// Initial DataFrame
df.show(false)
----------------------------------------------------------------------
|columnName |
----------------------------------------------------------------------
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
----------------------------------------------------------------------
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr : = elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
-----------------------------------------------------
|columnName |
-----------------------------------------------------
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
-----------------------------------------------------
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
---------------
|columnName |
---------------
|[[1, b], a, 7,]|
|[,,, x] |
---------------
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
------ ---- ----- ------
|info |name|other|random|
------ ---- ----- ------
|[1, b]|a |7 |null |
|null |null|null |x |
------ ---- ----- ------
Edit:
You can try this way if you can use get_json_object
function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns : = get_json_object(col("columnName"), "$." column).as(column)
}
df1.select(extract_columns: _*).show(false)
----------------------- ---- ----- ------
|info |name|other|random|
----------------------- ---- ----- ------
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
----------------------- ---- ----- ------
Please note that info
column is not of struct type. You may have to follow similar way to extract the columns of the nested json