Actually I am stuck in a problem where I have a dataframe with 2 columns having schema
scala> df1.printSchema
root
|-- actions: string (nullable = true)
|-- event_id: string (nullable = true)
actions column actually contains as array of objects but it's type is string and hence I can't use explode here
Sample data :
------------------------------------------------------------------------------------------------------------------
| event_id | actions |
------------------------------------------------------------------------------------------------------------------
| 1 | [{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}] |
------------------------------------------------------------------------------------------------------------------
There are some other keys present in each object of actions, but for simplicity I have taken 2 here.
I want to convert this to below format
OUTPUT :-
---------------------------------------
| event_id | name | score |
---------------------------------------
| 1 | Vijay | 843 |
---------------------------------------
| 2 | Manish | 840 |
---------------------------------------
| 3 | Mayur | 930 |
---------------------------------------
how can I do this with spark dataframe?
I have tried to read actions column using
val df2= spark.read.option("multiline",true).json(df1.rdd.map(row => row.getAs[String]("actions")))
but here I am not able to map event_id with each line.
CodePudding user response:
You can do this by using the from_json
function.
This function has 2 inputs:
- A column from which we want to read json string from
- A schema with which to parse the json string
That would look something like this:
import spark.implicits._
import org.apache.spark.sql.types._
// Reading in your data
val df = spark.read.option("sep", ";").option("header", "true").csv("./csvWithJson.csv")
df.show(false)
-------- ---------------------------------------------------------------------------------------------------
|event_id|actions |
-------- ---------------------------------------------------------------------------------------------------
|1 |[{"name": "Vijay", "score": 843},{"name": "Manish", "score": 840}, {"name": "Mayur", "score": 930}]|
-------- ---------------------------------------------------------------------------------------------------
// Creating the necessary schema for the from_json function
val actionsSchema = ArrayType(
new StructType()
.add("name", StringType)
.add("score", IntegerType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedActions",explode(from_json(col("actions"), actionsSchema)))
.drop("actions")
.select("event_id", "parsedActions.*")
parsedDf.show(false)
-------- ------ -----
|event_id| name|score|
-------- ------ -----
| 1| Vijay| 843|
| 1|Manish| 840|
| 1| Mayur| 930|
-------- ------ -----
Hope this helps!