Home > other >  parsing large JSON with java/GSON, can't read the JSON structure
parsing large JSON with java/GSON, can't read the JSON structure

Time:09-22

I'm trying to parse, using Java and GSON, a large (about 10GB) database dump in JSON format from the Musicbrainz.org

the JSON file has this structure. No '[' ']' to indicate that this is gonna be an array of objects, and no ',' between each object. Don't know why, but this JSON file is just like that.

{
    "id": "d0ab06e1-751a-414b-a976-da72670391b1",
    "name": "Arcing Wires",
    "sort-name": "Arcing Wires"
}
{
    "id": "6f0c2c16-dd7e-4268-a484-bc7b2ac78108",
    "name": "Another",
    "sort-name": "Another"
}
{
    "id": "e062b6cd-5506-47b0-afdb-72f4279ec38c",
    "name": "Agent S",
    "sort-name": "Agent S"
}

and this is the code that I'm using:

        try(JsonReader jsonReader = new JsonReader(
            new InputStreamReader(
                    new FileInputStream(jsonFilePath), StandardCharsets.UTF_8))) {
        Gson gson = new GsonBuilder().create();
        jsonReader.beginArray();
        while (jsonReader.hasNext()) {
            Artist mapped = gson.fromJson(jsonReader, Artist.class);
            //TODO do something with the object
            }
        }
        jsonReader.endArray();
    }
    catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

and the class that I mapped is this:

public class Artist {

@SerializedName("id")
public String id;
@SerializedName("name")
public String name;
@SerializedName("sort-name")
public String sortName;

}

the error I'm getting:

Exception in thread "main" java.lang.IllegalStateException: Expected BEGIN_ARRAY but was BEGIN_OBJECT at line 1 column 2 path $
at com.google.gson.stream.JsonReader.beginArray(JsonReader.java:350)
at DBLoader.parse(DBLoader.java:39)
at DBLoader.main(DBLoader.java:23)

I believe that the GSON expect a different structure from what I declared, but I don't understand how should I define this kind of JSON with no commas and no brackets. Any clues? thanks

CodePudding user response:

JSON by default declares one top value only (and yes, this would be a valid JSON document), but there is JSON streaming that uses arbitrary techniques to concatenate multiple JSON elements into a single stream assuming that the stream consumer can parse it (read more). Gson supports a so-called lenient mode that turns off the "one top value only" mode (and does some more things irrelevant to the question) for JsonReader: setLenient. Having the lenient mode on, you can read JSON elements one by one, and it turns out that this mode can be used to parse/read line-delimited JSON and concatenated JSON values since they are simply delimited by zero or more whitespaces that are ignored by Gson (therefore more exotic record separator-delimited JSON and length-prefixed JSON are unsupported). The reason of why it does not work for you is that your initial code assumes that the stream contains a single JSON array (and it does not obviously: it is supposed to be a stream of elements that does not conform the JSON array syntax).

A simple generic JSON stream support might look like this (using Stream API for its more rich API than Iterator has, but it is fine to show an idea, and you can easily adapt it to iterators, callbacks, observable streams, whatever you like):

@UtilityClass
public final class JsonStreamSupport {

    public static <T> Stream<T> parse(@WillNotClose final JsonReader jsonReader, final Function<? super JsonReader, ? extends T> readElement) {
        final boolean isLenient = jsonReader.isLenient();
        jsonReader.setLenient(true);
        final Spliterator<T> spliterator = new Spliterators.AbstractSpliterator<T>(Long.MAX_VALUE, Spliterator.ORDERED) {
            @Override
            public boolean tryAdvance(final Consumer<? super T> action) {
                try {
                    final JsonToken token = jsonReader.peek();
                    if ( token == JsonToken.END_DOCUMENT ) {
                        return false;
                    }
                    // TODO: read more elements in batch
                    final T element = readElement.apply(jsonReader);
                    action.accept(element);
                    return true;
                } catch ( final IOException ex ) {
                    throw new RuntimeException(ex);
                }
            }
        };
        return StreamSupport.stream(spliterator, false)
                .onClose(() -> jsonReader.setLenient(isLenient));
    }

}

And then:

JsonStreamSupport.<Artist>parse(jsonReader, jr -> gson.fromJson(jr, Artist.class))
        .forEach(System.out::println);

Output (assuming Artist has Lombok-generated toString()):

Artist(id=d0ab06e1-751a-414b-a976-da72670391b1, name=Arcing Wires, sortName=Arcing Wires)
Artist(id=6f0c2c16-dd7e-4268-a484-bc7b2ac78108, name=Another, sortName=Another)
Artist(id=e062b6cd-5506-47b0-afdb-72f4279ec38c, name=Agent S, sortName=Agent S)

How many bytes does such an approach, JSON streaming, save so that it is used at the service you're trying to consume? I don't know.

CodePudding user response:

It looks like jsonl format where every line is a valid JSON object. (read more here)
You can read file line by line and convert to object. I think it will works.

  • Related