I'm currently working on a scala job that will ingest the data from JSON file to Hive tables, but I did encounter some files that have a row/entry with invalid format. Here's the example:
[{"name":"John", "age":30, "address":"15 yemen road Yemen"},
{"name":"John", "age":30, "address":"",15 yemen road Yemen"}]
The address on the second entry is what causes the failure and the idea is to just drop that row. I already tried adding DROPMALFORMED
mode but still not working.
CodePudding user response:
You may want to remove the square brackets and turn your input into NLJSON format: one JSON object per line.
{"name":"John", "age":30, "address":"15 yemen road Yemen"}
{"name":"John", "age":30, "address":"",15 yemen road Yemen"}
With that input, the Spark setting DROPMALFORMED would remove the bad line, whereas now it will remove the whole array.
"Loads a JSON file (one object per line) and returns the result as a DataFrame"