Split large JSON file into smaller files-CodePudding

I have a need to split very large json file (20GB) into multiple smaller json files (Say threshold is 100 MB).

The Example file layout looks like this.

file.json

[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]},{"name":"kruger", "Place":"boston",
 "phone_number":["980281", "980282", "980283"]},{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]

The output should look like this

file1:

[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]}, {"name":"kruger", "Place":"boston", "phone_number":["980281", "980282", "980283"]}]

file2:

[{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]

May I know the best way to achieve this? Should i opt for shell command or python?

CodePudding user response：

The Python module json-stream can do this, with a few caveats, which I'll get to later.

You'll have to implement the visitor pattern.

import json_stream

def visitor(item, path):
    print(f"{item} at path {path}")

with open('mylargejsonfile.json','r') as f:
    json_stream.visit(f, visitor)

This visitor function will get called for each complete JSON type encountered, depth first so to speak. So each complete JSON element (number, string, array) will invoke this callback. It is up to you where to pause processing and write your partial file out.

Things to look out for include if your input file is a whole single element (like a single dictionary) you will have to change the out format to something else, because it can't be split across files. An absurd example would be trying to split a JSON file like this

{ "top" : [1,2,3] }

Into two separate files.

CodePudding user response：

As long as the file is structured that way with 1 item per line and no item in the main list that are a sub-list, you you just do a basic string replacement with sed. This is fragile, but relatively fast and memory efficient since sed is designed for streaming text.

Here is an example modifying "file.json" in-place:

sed -e 's/^\[//g' -e 's/, *$//g' -e 's/\]$//g' -i file.json

Then each line can be written in a separate file using a basic bash loop using read.

To compute the input file without modifying it and write the target files, you can do that:

i=1
sed -e 's/^\[//g' -e 's/, *$//g' -e 's/\]$//g' file.json | while read -r line; do
    echo -e "[$line]" > file$i
    i=$((i 1))
done

For the example file, it creates two files: file1 and file2