How to explode a one row Spark dataframe in JSON format to multiple rows?-CodePudding

Context: I'm learning PySpark and I am trying to run a sentiment analysis on tweets. After loading the data (that is in JSON format), I want to store it in a Spark Dataframe for preprocessing (removing uncessary symbols/words). Currently, I'm using an intermediate step that I want to eliminate: I'm loading the JSON into a pandas DataFrame and then to the spark Dataframe and it all works well.

However, when loading the JSON directly onto a PySpark DataFrame, all the data is stored in one row only.

How I'm loading the data:

df = spark.read.json("dbfs:/FileStore/tables/json_twitter.json").select("full_text")

The df is constituted by only one row and one column (full_text) with the following format:

{"0": "Hello", "1": "Tweet","2": "Bye"}

How can I efficiently turn this into a "normal" dataframe, having one row for each word?

Thank you

CodePudding user response：

If the value inside fulltext is a string you may first convert it to a map type using from_json example

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.withColumn("fulltext",F.from_json("fulltext",T.MapType(T.StringType(),T.StringType())))

before applying the explode function to split the values into multiple rows eg:

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.select(F.explode("fulltext"))

df.show(truncate=False)

 --- ----- 
|key|value|
 --- ----- 
|0  |Hello|
|1  |Tweet|
|2  |Bye  |
 --- -----

Edit 1

If the value inside fulltext is a struct, you may first

cast it to a string using cast
replace extra character braces using regexp_replace
split the string by comma using split
exploding the split value to get the desired rows using explode

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = df.withColumn("fulltext",F.col("fulltext").cast("string"))
df.printSchema() # only for debugging purposes
df.show() # only for debugging purposes

df = df.withColumn("fulltext",F.explode(F.split(F.regexp_replace("fulltext","\\{|\\}",""),",")))

df.show() # only for debugging purposes

root
 |-- fulltext: string (nullable = false)

 ------------------- 
|           fulltext|
 ------------------- 
|{Hello, Tweet, Bye}|
 ------------------- 

 -------- 
|fulltext|
 -------- 
|   Hello|
|   Tweet|
|     Bye|
 --------