Context: I'm learning PySpark and I am trying to run a sentiment analysis on tweets. After loading the data (that is in JSON format), I want to store it in a Spark Dataframe for preprocessing (removing uncessary symbols/words). Currently, I'm using an intermediate step that I want to eliminate: I'm loading the JSON into a pandas DataFrame and then to the spark Dataframe and it all works well.
However, when loading the JSON directly onto a PySpark DataFrame, all the data is stored in one row only.
How I'm loading the data:
df = spark.read.json("dbfs:/FileStore/tables/json_twitter.json").select("full_text")
The df is constituted by only one row and one column (full_text) with the following format:
{"0": "Hello", "1": "Tweet","2": "Bye"}
How can I efficiently turn this into a "normal" dataframe, having one row for each word?
Thank you
CodePudding user response:
If the value inside fulltext
is a string you may first convert it to a map type using from_json
example
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = df.withColumn("fulltext",F.from_json("fulltext",T.MapType(T.StringType(),T.StringType())))
before applying the explode
function to split the values into multiple rows eg:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = df.select(F.explode("fulltext"))
df.show(truncate=False)
--- -----
|key|value|
--- -----
|0 |Hello|
|1 |Tweet|
|2 |Bye |
--- -----
Edit 1
If the value inside fulltext
is a struct, you may first
- cast it to a string using
cast
- replace extra character braces using
regexp_replace
- split the string by comma using
split
- exploding the split value to get the desired rows using
explode
eg
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = df.withColumn("fulltext",F.col("fulltext").cast("string"))
df.printSchema() # only for debugging purposes
df.show() # only for debugging purposes
df = df.withColumn("fulltext",F.explode(F.split(F.regexp_replace("fulltext","\\{|\\}",""),",")))
df.show() # only for debugging purposes
root
|-- fulltext: string (nullable = false)
-------------------
| fulltext|
-------------------
|{Hello, Tweet, Bye}|
-------------------
--------
|fulltext|
--------
| Hello|
| Tweet|
| Bye|
--------