Home > Enterprise >  JQ - Recursive dedupe & concatenate JSON objects based on a uniquely identifying field
JQ - Recursive dedupe & concatenate JSON objects based on a uniquely identifying field

Time:09-16

I'm trying to dedupe and merge multiple JSON objects together, based on an id field on every JSON object.

In the example below, the values of the "employee.id" and "hobbygroup.id" fields are used to determine whether to dedupe the duplicated value and remove it, or instead concatenate its separate value into another location that preserves the hierarchy.

For example:

{
   "employee":[
      {
         "employee.id":"11",
         "employee.name":"bob",
         "hobbygroup":{
            "hobbygroup.id":"1",
            "hobbygroup.name":"chess",
            "groupmeeting":{
               "groupmeeting.id":"a",
               "groupmeeting.name":"kickoff meeting"
            }
         }
      },
      { # new hobbygroup.id, so this hobbygroup gets concatenated with the hobbygroup "1" above
         "employee.id":"11",
         "employee.name":"bob",
         "hobbygroup":{
            "hobbygroup.id":"2",
            "hobbygroup.name":"boxing",
            "groupmeeting":{
               "groupmeeting.id":"a",
               "groupmeeting.name":"kickoff meeting"
            }
         }
      },
      { # same hobbygroup.id and employee.id as before, but new group meeting, gets concat with groupmeeting "a"
         "employee.id":"11",
         "employee.name":"bob",
         "hobbygroup":{
            "hobbygroup.id":"2",
            "hobbygroup.name":"boxing",
            "groupmeeting":{
               "groupmeeting.id":"b",
               "groupmeeting.name":"second meeting"
            }
         }
      },
      { # duplicate employee.id as well as hobbygroup.id, gets deduped/ignored
         "employee.id":"11",
         "employee.name":"bob",
         "hobbygroup":{
            "hobbygroup.id":"2",
            "hobbygroup.name":"boxing"
         }
      },
      { # new employee.id, gets concatenated with employee 11 into an array
         "employee.id":"12",
         "employee.name":"bill",
         "hobbygroup":{
            "hobbygroup.id":"1",
            "hobbygroup.name":"chess",
            "groupmeeting":{
               "groupmeeting.id":"a",
               "groupmeeting.name":"kickoff meeting"
            }
         }
      }
   ]
}

Should return:

{
   "employee":[
      {
         "employee.id":"11",
         "employee.name":"bob",
         "hobbygroup":[
            {
               "hobbygroup.id":"1",
               "hobbygroup.name":"chess",
               "groupmeeting":{
                  "groupmeeting.id":"a",
                  "groupmeeting.name":"kickoff meeting"
               }
            },
            {
               "hobbygroup.id":"2",
               "hobbygroup.name":"boxing",
               "groupmeeting":[
                  {
                     "groupmeeting.id":"a",
                     "groupmeeting.name":"kickoff meeting"
                  },
                  {
                     "groupmeeting.id":"b",
                     "groupmeeting.name":"second meeting"
                  }
               ]
            }
         ]
      },
      {
         "employee.id":"12",
         "employee.name":"bill",
         "hobbygroup":{
            "hobbygroup.id":"11",
            "hobbygroup.name":"chess",
            "groupmeeting":{
               "groupmeeting.id":"a",
               "groupmeeting.name":"kickoff meeting"
            }
         }
      }
   ]
}

Example Explanation

At the first level, if the value of "employee.id" matches between the objects, then:

  1. Any child fields in the object (employee.name) gets deduped/overwritten.
  2. Any child objects (hobbygroup) get evaluated to see if its corresponding id field (hobbygroup.id) matches.
    • If matching, then steps 1 and 2 are reapplied recursively to this object. (ie, merge any fields, and then look at any children objects of hobbygroup)
    • If not matching, then these 2 hobbygroup objects get concatenated into an array of hobbygroups under the parent object (employee), as "hobbygroups": [{chess...}, {boxing...}]

Potential Inputs

  • Input JSON objects can have an arbitrary number of nesting levels.
  • Each of these objects have a field representing the dedupe key (employee.id & hobbygroup.id in this case). The input keys can all be preemptively renamed to something consistent to make it easier to access, such as "key.id".
  • The solution needs to be generic since there's a recursive requirement; it shouldn't require knowing the names of the fields or how it's nested.

Current approach I'm trying

I've found a couple of threads for merging JSON objects such as jq: recursively merge objects and concatenate arrays, which creates a function similar to the recursively merging multiply operation (*), except that it instead concatenates whenever it runs across an array.

I think a similar approach could work here, along with an additional conditional check against the id (only concatenating if the ids don't match). However, I'm new to JQ and it looks like there may be a few ways to get to the same result, so I might be missing a more simpler declarative solution.

Thanks!

CodePudding user response:

I'm not sure I understand the requirements, especially those regarding recursion, but the following does produce the desired result for the given example (after trivial corrections), and does include a recursive call to the "dedup" filter, so if it does not meet your requirements, hopefully it will provide a suitable framework for a solution.

# input: an array
def coalesce(f):
  (map([f]) | add | unique) as $f
  | add
  | f = $f ;

def dedup:
  if type == "array" and length > 1
  then unique
  | if (.[0] | has("employee.id"))
    then group_by(.["employee.id"])
    | map( coalesce(.hobbygroup) )
      else .
      end
  else .
  end ;

def recursively(f):
  if type == "array" then f | map(recursively(f))
  else .
  end;

recursively( dedup )

CodePudding user response:

Thank you for the clarifications. Still not sure about some edge cases, but this should cover most general cases. It detects parent-child relationships of one level distance if the child object has the same field name with .id added to it, and applies the merger recursively onto the first relationship found (except for the hard-coded top level as it is an array-object, not an object-object relationship). It also removes array brackets if only one child object is left.

def merge_by(f):
  group_by(f) | map(
    [paths[1:3] | select("\(first).id" == last)] as [$p]
    | if $p then first   {($p[0]): (
        map(getpath($p[:1])) | merge_by(getpath($p[1:]))
        | if length == 1 then first else . end
      )} else .[] end // empty
  );
  
.employee |= merge_by(."employee.id")
{
  "employee": [
    {
      "employee.id": "11",
      "employee.name": "bob",
      "hobbygroup": [
        {
          "hobbygroup.id": "1",
          "hobbygroup.name": "chess",
          "groupmeeting": {
            "groupmeeting.id": "a",
            "groupmeeting.name": "kickoff meeting"
          }
        },
        {
          "hobbygroup.id": "2",
          "hobbygroup.name": "boxing",
          "groupmeeting": [
            {
              "groupmeeting.id": "a",
              "groupmeeting.name": "kickoff meeting"
            },
            {
              "groupmeeting.id": "b",
              "groupmeeting.name": "second meeting"
            }
          ]
        }
      ]
    },
    {
      "employee.id": "12",
      "employee.name": "bill",
      "hobbygroup": {
        "hobbygroup.id": "1",
        "hobbygroup.name": "chess",
        "groupmeeting": {
          "groupmeeting.id": "a",
          "groupmeeting.name": "kickoff meeting"
        }
      }
    }
  ]
}

Demo

  • Related