I have this simple JSON file:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}
But when I'm trying to read it like this:
spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
I get:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
Which is wrong because adas_lane_keepAssist
arguments are not correct.
If in source JSON I change one of the adas_lane_keepAssist
arguments to "true", then the mapping is correct...
I also thought that maybe it's inferSchema
the root of the problem, so I've made a custom_schema:
custom_schema = StructType([
StructField("adas",StructType([
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
])),
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
]))
]))
])
and read it like this:
spark.read.schema(custom_schema) \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
And I get the same wrong result:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
The funny thing is if I change the order in my custom_shema
like this:
custom_schema = StructType([
StructField("adas",StructType([
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
])),
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
]))
]))
])
Now every argument of adas_parkAssist_front/left
is wrong:
{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}
Is this a limitation of PySpark?
CodePudding user response:
It's very strange to me too. I tried first
, head
and collect
, but they all returned the same distorted structure. Before those lines, if I printed the schema, it was correct. So, the problem is in functions first
, head
, collect
not working correctly with nested structs...
Looking for a workaround, I transformed the whole schema (which was correct after reading the JSON file) to a map type.
df = spark.read.json(r"path\test_file.json")
df = df.withColumn('adas', F.create_map(
F.lit('lane'), F.create_map(
F.lit('keepAssist'), F.create_map(
F.lit('left'), F.col('adas.lane.keepAssist.left'),
F.lit('right'), F.col('adas.lane.keepAssist.right')
)
),
F.lit('parkAssist'), F.create_map(
F.lit('front'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.front.alarm'),
F.lit('muted'), F.col('adas.parkAssist.front.muted')
),
F.lit('rear'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.rear.alarm'),
F.lit('muted'), F.col('adas.parkAssist.rear.muted')
)
)
))
print(df.head().asDict())
# {'adas': {'lane': {'keepAssist': {'left': False, 'right': False}}, 'parkAssist': {'rear': {'alarm': False, 'muted': False}, 'front': {'alarm': False, 'muted': False}}}}
CodePudding user response:
My spark version was 3.1.1.
After updating it to 3.2.0, the custom nested schema was read as expected.
Thx !