Home > database >  extract multiple columns from a json string
extract multiple columns from a json string

Time:12-07

I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet)

Schema

root
|-- : string (nullable = true)

sample data

 ---------------------------------------------- 

 ---------------------------------------------- 
|{"deviceTypeId":"A2A","deviceId":"123","geo...|
|{"deviceTypeId":"A2B","deviceId":"456","geo...|
 ---------------------------------------------- 

Expected Output

 -------------- ------------ 
|  deviceTypeId|deviceId|...| 
 -------------- -------- --- 
|           A2A|     123|   |
|           A2B|     456|   |
 -------------- -------- --- 

I tried splitting the string, but this doesn't seem like an efficient approach

split_col = split(df_explode[''], ',')

And then extract the columns, but it appends the initial string as well.

df_1 = df_explode.withColumn('deviceId',split_col.getItem(1))
# df_1 = df_explode.withColumn('deviceTypeId',split_col.getItem(0))
printOutput(df_1)

I'm looking for a better way to solve this problem

CodePudding user response:

Explode function is only to work on Array.

In your case which is a json, you should use from_json function.

Please refer from_json from pyspark.sql.functions

CodePudding user response:

I was able to do it using the from_json function.

#Convert json column to multiple columns
schema = getSchema()
dfJSON = df_explode.withColumn("jsonData",from_json(col(''),schema)) \
                   .select("jsonData.*")
dfJSON.printSchema()
dfJSON.limit(100).toPandas()

We need to create Json Schema that will parse the Json data.

def getSchema():
    schema = StructType([ 
                StructField('deviceTypeId', StringType()),
                StructField('deviceId', StringType()),
                ...
                ])
    return schema

The value string is empty in this Json data so the col consists of empty string

  • Related