Home > other >  Explode Complex JSON Object (with Objects and Arrays) with jq
Explode Complex JSON Object (with Objects and Arrays) with jq

Time:10-21

We are receiving data from a customer and it is highly nested (the customer has done this to reduce the amount of duplicated data, thereby reducing the overall size of data transferred).

The data looks like this:

{
    "r1": "ex",
    "r2": "of",
    "r3": "da",
    "ver": 1,
    "noa1": [{
        "col1": 380,
        "noa2": [{
            "aon": [123],
            "obj": {
                "col10": "stuf",
                "col11": "here",
                "aos": ["A",
                    "X"
                ]
            }
        }, {
            "aon": [],
            "obj": {
                "col10": "more",
                "col11": "stuf"
            }
        }, {
            "aon": [456, 789]
        }, {
            "obj": {
                "col10": "anon"
            }
        }]
    }, {
        "col1": 5676,
        "noa2": [{
            "aon": [875],
            "obj": {
                "col10": "does",
                "col11": "noth",
                "aos": ["Y",
                    "Z"
                ]
            }
        }, {
            "obj": {
                "col11": "phew"
            }
        }]
    }]
}

where there are a number of optional nested objects and arrays (around 6 levels deep IRL, but simplified here).

We have to flatten this to a CSV, where each row would be unique containing the data of the JSON object. We would not want any "new rows" created from a cross product join. So the above would look like:

r1|r2|r3|ver|col1|aon|col10|col11|aos
ex|of|da|1  |380 |123|stuf |here |A
ex|of|da|1  |380 |123|stuf |here |X
ex|of|da|1  |380 |   |more |stuf |
ex|of|da|1  |380 |456|     |     |
ex|of|da|1  |380 |789|     |     |
ex|of|da|1  |380 |   | anon|     |
ex|of|da|1  |5676|875|does |noth |Y
ex|of|da|1  |5676|875|does |noth |Z
ex|of|da|1  |5676|   |     |phew |

For brevity, we show this as a table. In reality, we would want this to be a collection of JSON documents (jq -c).

We have been working with jq, but are getting stuck with working with nested arrays that are not objects, like an array of strings. It is causing a cross join to occur, putting values in places that do not exist in the original JSON. We also have trouble with optional nested rows (jq is currently not printing the row if an array is null, even with the ? modifier.

Questions are:

  1. What would be the jq filter for the original JSON?
  2. In general, how would one handle nested objects, arrays, and arrays of string, numbers, etc within jq?

Updated: The following code

.r1 as $r1 | .r2 as $r2 | .r3 as $r3 | .ver as $ver | .noa1[] | . as $noa1 | .noa2[] | . as $noa2 | {$r1, $r2, $r3, $ver, "col1": $noa1.col1, "aon": $noa2.aon }

returns

{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":380,"aon":[123]}
{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":380,"aon":[]}
{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":380,"aon":[456,789]}
{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":380,"aon":null}
{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":5676,"aon":[875]}
{"r1":"ex","r2":"of","r3":"da","ver":1,"col1":5676,"aon":null}

which makes sense. But now when we try to go further and explode the "aon" array of numbers, it gets weird. This code

.r1 as $r1 | .r2 as $r2 | .r3 as $r3 | .ver as $ver | .noa1[] | . as $noa1 | .noa2[] | . as $noa2 | 
{$r1, $r2, $r3, $ver, "col1": $noa1.col1, "aon": $noa2.aon[]? }

loses the rows that have a null value for "aon", which I can understand, kind of (because of the ?). But this code fails

.r1 as $r1 | .r2 as $r2 | .r3 as $r3 | .ver as $ver | .noa1[] | . as $noa1 | .noa2[] | . as $noa2 | 
{$r1, $r2, $r3, $ver, "col1": $noa1.col1, "aon": $noa2.aon[] }

which I also understand, because of the null values. So we try to reference it as a variable:

.r1 as $r1 
| .r2 as $r2 
| .r3 as $r3 
| .ver as $ver 
| .noa1[] | . as $noa1 
| .noa2[] | . as $noa2
| .obj as $obj
| .aon[]? | . as $aon
| {$r1, $r2, $r3, $ver, "col1": $noa1.col1, "aon": $aon, "col10": $obj.col10, "col11": $obj.col11 }

and get this far. But a couple of issues: we've lost some rows where aon is null, and we can't figure out how to access "aos".

CodePudding user response:

I'm not sure whether you want the "aos" field to be present in every generated object. If not, you could do worse than:

jq -c '
  .r1 as $r1
  | .r2 as $r2
  | .r3 as $r3
  | .ver as $ver
  | .noa1[] 
  | .col1 as $col1
  | .noa2[] 
  | .aon as $aon
  | .obj
  | {$r1, $r2, $ver, $col1, $aon, col10, col11} 
      try {aos: .aos[]} catch {} '

If you do, then you could replace the very last {} by {aos: null} or whatever. If you prefer not to use try ... catch then you could replace it by an if statement, or perhaps by:

( (.aos // [null]) | {aos: .[]} )

The key to preserving sanity here is, I think, to avoid creating jq variables except as needed as you "drill down". Also, jq's support for abbreviations such as {$x} and {y} makes things easier to read and review.

  • Related