I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet)
Schema
root
|-- : string (nullable = true)
sample data
----------------------------------------------
----------------------------------------------
|{"deviceTypeId":"A2A","deviceId":"123","geo...|
|{"deviceTypeId":"A2B","deviceId":"456","geo...|
----------------------------------------------
Expected Output
-------------- ------------
| deviceTypeId|deviceId|...|
-------------- -------- ---
| A2A| 123| |
| A2B| 456| |
-------------- -------- ---
I tried splitting the string, but this doesn't seem like an efficient approach
split_col = split(df_explode[''], ',')
And then extract the columns, but it appends the initial string as well.
df_1 = df_explode.withColumn('deviceId',split_col.getItem(1))
# df_1 = df_explode.withColumn('deviceTypeId',split_col.getItem(0))
printOutput(df_1)
I'm looking for a better way to solve this problem
CodePudding user response:
Explode function is only to work on Array.
In your case which is a json, you should use from_json function.
Please refer from_json from pyspark.sql.functions
CodePudding user response:
I was able to do it using the from_json function.
#Convert json column to multiple columns
schema = getSchema()
dfJSON = df_explode.withColumn("jsonData",from_json(col(''),schema)) \
.select("jsonData.*")
dfJSON.printSchema()
dfJSON.limit(100).toPandas()
We need to create Json Schema that will parse the Json data.
def getSchema():
schema = StructType([
StructField('deviceTypeId', StringType()),
StructField('deviceId', StringType()),
...
])
return schema
The value string is empty in this Json data so the col consists of empty string