How to transpose JSON structs and arrays in PySpark-CodePudding

I have the following Json file that I'm reading into a dataframe.

{
  "details": {
    "box": [
      {
        "Touchdowns": "123",
        "field": "Texans"
      },
      {
        "Touchdowns": "456",
        "field": "Ravens"
      }
    ]
  },
  "name": "Team"
}

How could I manipulate this to get the following output?

Team	Touchdowns
Texans	123
Ravens	456

I'm struggling a bit with whether I need to pivot/transpose the data or if there is a more elegant approach.

CodePudding user response：

Read the multiline json into spark

df = spark.read.json('/path/to/scores.json',multiLine=True)

Schema

df:pyspark.sql.dataframe.DataFrame
    details:struct
        box:array
           element:struct
               Touchdowns:string
               field:string
     name:string

All of the info you want is in the first row, so get that and drill down to details and box and make that your new dataframe.

spark.createDataFrame(df.first()['details']['box']).withColumnRenamed('field','Team').show()

Output

 ---------- ------ 
|Touchdowns|  Team|
 ---------- ------ 
|       123|Texans|
|       456|Ravens|
 ---------- ------

CodePudding user response：

You can use the inline function.

df = spark.read.load(json_file_path, format='json', multiLine=True)
df = df.selectExpr('inline(details.box)').withColumnRenamed('field', 'Team')
df.show(truncate=False)

CodePudding user response：

You can try using a rdd to get the values of box list.

Input JSON

jsonstr="""{
  "details": {
    "box": [
      {
        "Touchdowns": "123",
        "field": "Texans"
      },
      {
        "Touchdowns": "456",
        "field": "Ravens"
      }
    ]
  },
  "name": "Team"
}"""

Now convert it to an rdd using the keys of dictionary as below -

import json
box_rdd = sc.parallelize(json.loads(jsonstr)['details']['box'])
box_rdd.collect()

Output - [{'Touchdowns': '123', 'field': 'Texans'},
 {'Touchdowns': '456', 'field': 'Ravens'}]

Finally create the dataframe with this box_rdd as below -

from pyspark.sql.types import *
schema = StructType([StructField('Touchdowns', StringType(), True), StructField('field', StringType(), True)])

df = spark.createDataFrame(data=box_rdd,schema=schema)
df.show()

 ---------- ------ 
|Touchdowns| field|
 ---------- ------ 
|       123|Texans|
|       456|Ravens|
 ---------- ------