Home > Software design >  How to read a 100 GB file with jq that won't make me ran out of memory
How to read a 100 GB file with jq that won't make me ran out of memory

Time:10-16

I have a 100 GB json file and when I try to read it with jq my computer keeps running our of ram. Is there a way to read the file while limiting the memory usage or some other way to read a VERY huge json file?

What I typed in the command jq 'keys' fileName.json

CodePudding user response:

I posted a related question here: Difference between slurp, null input, and inputs filter

If your file is huge, but the documents inside the file aren't that big (just many many smaller ones), jq -n 'inputs' could get you started:

jq -n 'inputs | keys'

Here's an example (with a small file):

$ cat <<<JSON | jq -n 'inputs | keys'
{"foo": 21, "bar": "less interesting data"}
{"foo": 42, "bar": "more interesting data"}
JSON
[
  "bar",
  "foo"
]
[
  "bar",
  "foo"
]

This approach will not work if you have a single top-level object that is gigabytes big or has millions of keys.

CodePudding user response:

jq's streaming parser (invoked using the --stream option) can generally handle very, very large files (and even arbitrarily large files provided certain conditions are met), but it is typically very slow and often quite cumbersome.

In practice, I find that tools such as jstream and/or my own jm work very nicely in conjunction with jq when dealing with ginormous files. When used this way, they are both very easy to use, though installation is potentially a bit of a hassle.

Unfortunately, if you know nothing at all about the contents of a JSON file except that jq empty takes too long or fails, then there is no CLI-tool that I know of that can produce a useful schema automagically. However, looking at the first few bytes of the file will usually provide enough information to get going. Or you could start with jm count to give a count of the top-level objects, and go from there. jm -s | jq 'keys[]' will give you the list of top-level keys if the top-level is a JSON object.


Here's an example. Suppose we have determined that the large size of the file ginormous.json is primarily because it consists of a very long top-level array. Then assuming that schema.jq (already mentioned elsewhere on this page) is in the pwd, you have some hope of finding an informative schema by running:

jm ginormous.json |
  jq -n 'include "schema" {source:"."}; schema(inputs)'

CodePudding user response:

One generic way to determine the structure of a very large file containing a single JSON entity would be to run the following query:

jq -nc --stream -f structural-paths.jq huge.json | sort -u

where structural_paths.jq contains:

inputs
| select(length == 2)
| .[0]
| map( if type == "number" then 0 else . end )

Note that the '0's in the output signify that there is at least one valid array index at the corresponding position, not that '0' is actually a valid index at that position.

Note also that for very large files, using jq --stream to process the entire file could be quite slow.

Example:

Given {"a": {"b": [0,1, {"c":2}]}}, the result of the above incantation would be:

["a","b",0,"c"]
["a","b",0]

Top-level structure

If you just want more information about the top-level structure, you could simplify the above jq program to:

inputs | select(length==1)[0][0] | if type == "number" then 0 else . end

Structure to a given depth

If the command-line sort fails, then you might want to limit the number of paths by considering them only to a certain depth. You might also wish to consider using the command-line uniq instead of sort, especially if the order of keys is of interest. Yet another alternative would be to dispense with the postprocessing using the command-line sort or uniq, and use a jq-defined uniq, as in the following:

def uniq(s):
  foreach s as $x (null;
    if . and $x == .[0] then .[1] = false
    else [$x, true]
    end;
    if .[1] then .[0] else empty end);

def spaths($depth):
  inputs
  | select(length==1)[0][0:$depth]
  | map(if type == "number" then 0 else . end);

uniq(spaths($depth))

A suitable invocation would then look like:

jq -nc --argjson depth 3 --stream -f structural-paths.jq huge.json

"JSON Pointer" pointers

If you want to convert array path expressions to "JSON Pointer" strings (e.g. for use with jm or jstream), simply append the following to the relevant jq program:

| "/"   join("/")
  • Related