Array of JSON to Dataframe in pyspark-CodePudding

I'm having some issues with reading items from Json file, and having some issues getting the data out of it to columns.

{
   "sample":[
      {
         "value":"Red",
         "id":"1"
      },
      {
         "value":"green",
         "id":"2"
      },
      {
         "value":"orange",
         "id":"3"
      }
   ],
   "scientific_names":"Buxus microphylla",
   "gender":"bushes",
   "examples":"Oleander"
}

I would like to get this JSON objects in a dataframe like

 ------------------ ------------------- -------- ---------- -- 
| sample           | scientific_names  | gender | examples |  |
 ------------------ ------------------- -------- ---------- -- 
| Red,green,orange | Buxus microphylla | bushes | Oleander |  |
 ------------------ ------------------- -------- ---------- -- 
|                  |                   |        |          |  |
 ------------------ ------------------- -------- ---------- -- 
|                  |                   |        |          |  |

Anyone can help me please? Thank you!

CodePudding user response：

You can simply pass a proper schema and select value column

# a.json
# {...} your full json sample

schema = T.StructType([
    T.StructField('sample', T.ArrayType(T.StructType([
        T.StructField('id', T.StringType()),
        T.StructField('value', T.StringType())
    ]))),
    T.StructField('scientific_names', T.StringType()),
    T.StructField('gender', T.StringType()),
    T.StructField('examples', T.StringType()),
])

(spark
    .read
    .json('a.json', schema=schema, multiLine=True)
    .withColumn('sample', F.col('sample.value'))
    .show(10, False)
)

# Output
#  -------------------- ----------------- ------ -------- 
# |sample              |scientific_names |gender|examples|
#  -------------------- ----------------- ------ -------- 
# |[Red, green, orange]|Buxus microphylla|bushes|Oleander|
#  -------------------- ----------------- ------ --------