Spark dataframe map root key with elements of array of another column of string type-CodePudding

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema

    scala> df1.printSchema
    root
     |-- actions: string (nullable = true)
     |-- event_id: string (nullable = true)

actions column actually contains as array of objects but it's type is string and hence I can't use explode here

Sample data :

------------------------------------------------------------------------------------------------------------------
| event_id |                                        actions                                                       |
------------------------------------------------------------------------------------------------------------------
|    1     | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]  |
------------------------------------------------------------------------------------------------------------------

There are some other keys present in each object of actions, but for simplicity I have taken 2 here.

I want to convert this to below format

OUTPUT :-

---------------------------------------
| event_id | name      |    score      |
---------------------------------------
|   1      | Vijay     |    843        |
---------------------------------------
|   2      | Manish    |    840        |
---------------------------------------
|   3      | Mayur     |    930        |
---------------------------------------

how can I do this with spark dataframe?

I have tried to read actions column using

val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))

but here I am not able to map event_id with each line.

CodePudding user response：

You can do this by using the from_json function. This function has 2 inputs:

A column from which we want to read json string from
A schema with which to parse the json string

That would look something like this:

import spark.implicits._
import org.apache.spark.sql.types._

// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")

df.show(false)
 -------- ---------------------------------------------------------------------------------------------------                                                                                                                                                                   
|event_id|actions                                                                                            |                                                                                                                                                                  
 -------- ---------------------------------------------------------------------------------------------------                                                                                                                                                                   
|1       |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|                                                                                                                                                                  
 -------- --------------------------------------------------------------------------------------------------- 


// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
  new StructType()
    .add("name", StringType)
    .add("score", IntegerType)
  )

// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
  .withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
  .drop("actions")
  .select("event_id", "parsedActions.*")

parsedDf.show(false)
 -------- ------ -----                                                                                                                                                                                                                                                          
|event_id|  name|score|                                                                                                                                                                                                                                                         
 -------- ------ -----                                                                                                                                                                                                                                                          
|       1| Vijay|  843|                                                                                                                                                                                                                                                         
|       1|Manish|  840|                                                                                                                                                                                                                                                         
|       1| Mayur|  930|                                                                                                                                                                                                                                                         
 -------- ------ -----

Hope this helps!