I have a need to split very large json file (20GB) into multiple smaller json files (Say threshold is 100 MB).
The Example file layout looks like this.
file.json
[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]},{"name":"kruger", "Place":"boston",
"phone_number":["980281", "980282", "980283"]},{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]
The output should look like this
file1:
[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]}, {"name":"kruger", "Place":"boston", "phone_number":["980281", "980282", "980283"]}]
file2:
[{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]
May I know the best way to achieve this? Should i opt for shell command or python?
CodePudding user response:
The Python module json-stream
can do this, with a few caveats, which I'll get to later.
You'll have to implement the visitor pattern.
import json_stream
def visitor(item, path):
print(f"{item} at path {path}")
with open('mylargejsonfile.json','r') as f:
json_stream.visit(f, visitor)
This visitor
function will get called for each complete JSON type encountered, depth first so to speak. So each complete JSON element (number, string, array) will invoke this callback. It is up to you where to pause processing and write your partial file out.
Things to look out for include if your input file is a whole single element (like a single dictionary) you will have to change the out format to something else, because it can't be split across files. An absurd example would be trying to split a JSON file like this
{ "top" : [1,2,3] }
Into two separate files.
CodePudding user response:
As long as the file is structured that way with 1 item per line and no item in the main list that are a sub-list, you you just do a basic string replacement with sed
. This is fragile, but relatively fast and memory efficient since sed
is designed for streaming text.
Here is an example modifying "file.json" in-place:
sed -e 's/^\[//g' -e 's/, *$//g' -e 's/\]$//g' -i file.json
Then each line can be written in a separate file using a basic bash loop using read
.
To compute the input file without modifying it and write the target files, you can do that:
i=1
sed -e 's/^\[//g' -e 's/, *$//g' -e 's/\]$//g' file.json | while read -r line; do
echo -e "[$line]" > file$i
i=$((i 1))
done
For the example file, it creates two files: file1
and file2