Home > Back-end >  Spark dataframe map root key with elements of array of another column of string type
Spark dataframe map root key with elements of array of another column of string type

Time:12-08

Actually I am stuck in a problem where I have a dataframe with 2 columns having schema

    scala> df1.printSchema
    root
     |-- actions: string (nullable = true)
     |-- event_id: string (nullable = true)

actions column actually contains as array of objects but it's type is string and hence I can't use explode here

Sample data :

------------------------------------------------------------------------------------------------------------------
| event_id |                                        actions                                                       |
------------------------------------------------------------------------------------------------------------------
|    1     | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]  |
------------------------------------------------------------------------------------------------------------------

There are some other keys present in each object of actions, but for simplicity I have taken 2 here.

I want to convert this to below format

OUTPUT :-

---------------------------------------
| event_id | name      |    score      |
---------------------------------------
|   1      | Vijay     |    843        |
---------------------------------------
|   2      | Manish    |    840        |
---------------------------------------
|   3      | Mayur     |    930        |
---------------------------------------

how can I do this with spark dataframe?

I have tried to read actions column using

val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))

but here I am not able to map event_id with each line.

CodePudding user response:

You can do this by using the from_json function. This function has 2 inputs:

  • A column from which we want to read json string from
  • A schema with which to parse the json string

That would look something like this:

import spark.implicits._
import org.apache.spark.sql.types._

// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")

df.show(false)
 -------- ---------------------------------------------------------------------------------------------------                                                                                                                                                                   
|event_id|actions                                                                                            |                                                                                                                                                                  
 -------- ---------------------------------------------------------------------------------------------------                                                                                                                                                                   
|1       |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|                                                                                                                                                                  
 -------- --------------------------------------------------------------------------------------------------- 


// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
  new StructType()
    .add("name", StringType)
    .add("score", IntegerType)
  )

// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
  .withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
  .drop("actions")
  .select("event_id", "parsedActions.*")

parsedDf.show(false)
 -------- ------ -----                                                                                                                                                                                                                                                          
|event_id|  name|score|                                                                                                                                                                                                                                                         
 -------- ------ -----                                                                                                                                                                                                                                                          
|       1| Vijay|  843|                                                                                                                                                                                                                                                         
|       1|Manish|  840|                                                                                                                                                                                                                                                         
|       1| Mayur|  930|                                                                                                                                                                                                                                                         
 -------- ------ ----- 

Hope this helps!

  • Related