How to drop a row in a JSON file with invalid format-CodePudding

I'm currently working on a scala job that will ingest the data from JSON file to Hive tables, but I did encounter some files that have a row/entry with invalid format. Here's the example:

[{"name":"John", "age":30, "address":"15 yemen road Yemen"},
{"name":"John", "age":30, "address":"",15 yemen road Yemen"}]

The address on the second entry is what causes the failure and the idea is to just drop that row. I already tried adding DROPMALFORMED mode but still not working.

CodePudding user response：

You may want to remove the square brackets and turn your input into NLJSON format: one JSON object per line.

{"name":"John", "age":30, "address":"15 yemen road Yemen"}
{"name":"John", "age":30, "address":"",15 yemen road Yemen"}

With that input, the Spark setting DROPMALFORMED would remove the bad line, whereas now it will remove the whole array.

Cf. Spark DataFrameReader

"Loads a JSON file (one object per line) and returns the result as a DataFrame"