I'm having some issues with reading items from Json file, and having some issues getting the data out of it to columns.
{
"sample":[
{
"value":"Red",
"id":"1"
},
{
"value":"green",
"id":"2"
},
{
"value":"orange",
"id":"3"
}
],
"scientific_names":"Buxus microphylla",
"gender":"bushes",
"examples":"Oleander"
}
I would like to get this JSON objects in a dataframe like
------------------ ------------------- -------- ---------- --
| sample | scientific_names | gender | examples | |
------------------ ------------------- -------- ---------- --
| Red,green,orange | Buxus microphylla | bushes | Oleander | |
------------------ ------------------- -------- ---------- --
| | | | | |
------------------ ------------------- -------- ---------- --
| | | | | |
Anyone can help me please? Thank you!
CodePudding user response:
You can simply pass a proper schema and select value
column
# a.json
# {...} your full json sample
schema = T.StructType([
T.StructField('sample', T.ArrayType(T.StructType([
T.StructField('id', T.StringType()),
T.StructField('value', T.StringType())
]))),
T.StructField('scientific_names', T.StringType()),
T.StructField('gender', T.StringType()),
T.StructField('examples', T.StringType()),
])
(spark
.read
.json('a.json', schema=schema, multiLine=True)
.withColumn('sample', F.col('sample.value'))
.show(10, False)
)
# Output
# -------------------- ----------------- ------ --------
# |sample |scientific_names |gender|examples|
# -------------------- ----------------- ------ --------
# |[Red, green, orange]|Buxus microphylla|bushes|Oleander|
# -------------------- ----------------- ------ --------