Home > Back-end >  How to explode a one row Spark dataframe in JSON format to multiple rows?
How to explode a one row Spark dataframe in JSON format to multiple rows?

Time:12-01

Context: I'm learning PySpark and I am trying to run a sentiment analysis on tweets. After loading the data (that is in JSON format), I want to store it in a Spark Dataframe for preprocessing (removing uncessary symbols/words). Currently, I'm using an intermediate step that I want to eliminate: I'm loading the JSON into a pandas DataFrame and then to the spark Dataframe and it all works well.

However, when loading the JSON directly onto a PySpark DataFrame, all the data is stored in one row only.

How I'm loading the data:

df = spark.read.json("dbfs:/FileStore/tables/json_twitter.json").select("full_text")

The df is constituted by only one row and one column (full_text) with the following format:

{"0": "Hello", "1": "Tweet","2": "Bye"}

How can I efficiently turn this into a "normal" dataframe, having one row for each word?

Thank you

CodePudding user response:

If the value inside fulltext is a string you may first convert it to a map type using from_json example

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.withColumn("fulltext",F.from_json("fulltext",T.MapType(T.StringType(),T.StringType())))

before applying the explode function to split the values into multiple rows eg:

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.select(F.explode("fulltext"))

df.show(truncate=False)
 --- ----- 
|key|value|
 --- ----- 
|0  |Hello|
|1  |Tweet|
|2  |Bye  |
 --- ----- 

Edit 1

If the value inside fulltext is a struct, you may first

  1. cast it to a string using cast
  2. replace extra character braces using regexp_replace
  3. split the string by comma using split
  4. exploding the split value to get the desired rows using explode

eg

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.withColumn("fulltext",F.col("fulltext").cast("string"))
df.printSchema() # only for debugging purposes
df.show() # only for debugging purposes

df = df.withColumn("fulltext",F.explode(F.split(F.regexp_replace("fulltext","\\{|\\}",""),",")))

df.show() # only for debugging purposes
root
 |-- fulltext: string (nullable = false)

 ------------------- 
|           fulltext|
 ------------------- 
|{Hello, Tweet, Bye}|
 ------------------- 

 -------- 
|fulltext|
 -------- 
|   Hello|
|   Tweet|
|     Bye|
 -------- 
  • Related