Home > Software engineering >  Processing and validating JSON containing duplicate keys
Processing and validating JSON containing duplicate keys

Time:12-20

I'm trying to validate "json" files that we receive because the source code that generates these files has some issues that cannot be corrected without a major overhaul. There are many objects in the "json" that are invalid. An example below, which reuses keys for port naming.

example invalid json file

[
  {"TimeStamp": "2021-11-28", "Address": { "port": "eth2 present", "port": "eth0 present",  "port": "eth1 present" }},
  {"TimeStamp": "2021-11-29", "CamStatus": 1},
  {"TimeStamp": "2021-11-30", "CamDone": 0}
]

What I am trying to do is first identify which of the rows are invalid. From there, I want to clean them up, if possible.

Using json.load(), I see an odd behavior where the invalid json is parsed but excludes two key/value pairs. Curious if this expected, becuase I would have expected a ValueError.

with open(r"sample.json") as json_file:
    content = json.load(json_file)
content

Result

[{'TimeStamp': '2021-11-28', 'Address': {'port': 'eth1 present'}},
 {'TimeStamp': '2021-11-29', 'CamStatus': 1},
 {'TimeStamp': '2021-11-30', 'CamDone': 0}]

To identify corrupt rows, I wrote the below using json.loads(), but I am also getting unexpected behavior where the second object is being read as invalid.

with open("sample.json") as json_file:
    for line in json_file:
        try:
            a = json.loads(line)
            print('valid JSON', line)
        except:
            print('invalid JSON', line) 

Output

invalid JSON [
invalid JSON {"TimeStamp": "2021-11-28", "Address": { "port": "eth2 present", "port": "eth0 present",  "port": "eth1 present" }},
invalid JSON {"TimeStamp": "2021-11-29", "CamStatus": 1},
valid JSON   {"TimeStamp": "2021-11-30", "CamDone": 0}
invalid JSON ]

What I am attempting to do is generate a structure like below:

[{'TimeStamp': '2021-11-28', 'Address': {'port0': 'eth1 present', 'port1': 'eth2 present', 'port2': 'eth3 present'}},
 {'TimeStamp': '2021-11-29', 'CamStatus': 1},
 {'TimeStamp': '2021-11-30', 'CamDone': 0}]

Any thoughts, modules, sample code that could help me out?

CodePudding user response:

You can give the JSON parser an argument that tells it to use an object other than a standard dict when it collects an object consisting of key/value pairs. If you provide a multidict constructor in this argument, then you will end up with a structure that retains all of the information in the original JSON file even if that file contains duplicate keys (and so was invalid JSON).

Once you do this, you can then query the data structure to find places where there are multiple keys with the same name in one of the inner MultiDict objects, and flag the lines for which that occurs.

Here's a simple example that does that for data like what you show. I expect that it would need to be modified for a real world situation, but it demonstrates the basic idea of using the JSON parser in this way, and then flagging invalid key/value pair structures in the result:

data = """
[
  {"TimeStamp": "2021-11-29", "CamStatus": 1},
  {"TimeStamp": "2021-11-29", "CamStatus": 1},
  {"TimeStamp": "2021-11-28", "Address": { "port": "eth2 present", "port": "eth0 present",  "port": "eth1 present" }},
  {"TimeStamp": "2021-11-29", "CamStatus": 1},
  {"TimeStamp": "2021-11-28", "Address": { "port": "eth2 present", "port1": "eth0 present",  "port2": "eth1 present" }},
  {"TimeStamp": "2021-11-28", "Address": { "port": "eth2 present", "port": "eth0 present",  "port2": "eth1 present" }},
  {"TimeStamp": "2021-11-30", "CamDone": 0}
]
"""

from multidict import MultiDict
import json

r = json.loads(data, object_pairs_hook=MultiDict)

for i, entry in enumerate(r):
    for key, value in entry.items():
        if isinstance(value, MultiDict):
            keys = set()
            for key2, value2 in value.items():
                if key2 in keys:
                    print(f"Line {i 1} is invalid")
                    break
                else:
                    keys.add(key2)

Result:

Line 3 is invalid
Line 6 is invalid

You could walk the resulting structure and built up a new structure from it where you deal with the duplicate keys when you encounter them, thereby "fixing" the incoming data structure. You could either rename the keys to, in this case, port1, port2, etc., or you could collect the values in such cases into lists, so like "port": ["eth0 present", "eth2 present"]. You could then write that structure out using json.dump to produce a valid JSON file.

Here's some code that takes the former approach to convert the read structure back into a structure consisting only of plain dict objects. Note that the underscore in the new key names is necessary to avoid clashing with existing names. This could be avoided, but the logic would have to be a bit more complex to allow for collision with existing key names with numbers at the end of them. The use of the '_' is good for the demo as it makes it clear which keys were renamed.

fixed = []
for entry in r:
    new_entry = {}
    for key, value in entry.items():
        if isinstance(value, MultiDict):
            new_entry[key] = {}
            keys = {}
            for key2, value2 in value.items():
                if key2 in keys:
                    keys[key2]  = 1
                    next_key = key2   '_'   str(keys[key2])
                else:
                    keys[key2] = 0
                    next_key = key2
                new_entry[key][next_key] = value2
        else:
            new_entry[key] = value
    fixed.append(new_entry)

pprint(fixed)

Result:

[{'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'Address': {'port': 'eth2 present',
              'port_1': 'eth0 present',
              'port_2': 'eth1 present'},
  'TimeStamp': '2021-11-28'},
 {'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'Address': {'port': 'eth2 present',
              'port1': 'eth0 present',
              'port2': 'eth1 present'},
  'TimeStamp': '2021-11-28'},
 {'Address': {'port': 'eth2 present',
              'port2': 'eth1 present',
              'port_1': 'eth0 present'},
  'TimeStamp': '2021-11-28'},
 {'CamDone': 0, 'TimeStamp': '2021-11-30'}]

And here's code that uses the second approach:

fixed = []
for entry in r:
    new_entry = {}
    for key, value in entry.items():
        if isinstance(value, MultiDict):
            new_entry[key] = {}
            keys = {}
            for key2, value2 in value.items():
                if key2 in keys:
                    keys[key2].append(value2)
                    new_entry[key][key2] = keys[key2]
                else:
                    keys[key2] = [value2]
                    new_entry[key][key2] = value2
        else:
            new_entry[key] = value
    fixed.append(new_entry)

pprint(fixed)

Result:

[{'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'Address': {'port': ['eth2 present', 'eth0 present', 'eth1 present']},
  'TimeStamp': '2021-11-28'},
 {'CamStatus': 1, 'TimeStamp': '2021-11-29'},
 {'Address': {'port': 'eth2 present',
              'port1': 'eth0 present',
              'port2': 'eth1 present'},
  'TimeStamp': '2021-11-28'},
 {'Address': {'port': ['eth2 present', 'eth0 present'],
              'port2': 'eth1 present'},
  'TimeStamp': '2021-11-28'},
 {'CamDone': 0, 'TimeStamp': '2021-11-30'}]
  • Related