Fixing Invalid JSON with Regex-CodePudding

I have a large list (6,000 ) of JSON objects, each containing a "tags" attribute that I need to parse using Python. Unfortunately, the formatting for each list of tags varies wildly so I'm having to do a lot of cleaning up to get them into a format I can actually use. I've been able to fix some of the data but I'm struggling to come up with an effective regex pattern (or other solution) to fix tags that have the following structure:

[
  {
    "Value": "String",
    "Key": "ITEM_1"
  },
  {
    "Value": "String",
    "Key": "ITEM_2"
  },
  {
    "Value": {
      "First": "String",
      Second: String With Spaces,
      "Third": "String",
      Fourth: [email protected],
      "Fifth": String
    }
    "Key": "ITEM_3"
  }
]

The best I've been able to come up with so far is:

tags = re.sub("(\w ):", r'"\1":', tags)
tags = re.sub(r': (\w )', r': "\1"', tags)

Which produces:

[
  {
    "Value": "String",
    "Key": "ITEM_1"
  },
  {
    "Value": "String",
    "Key": "ITEM_2"
  },
  {
    "Value": {
      "First": "String",
      "Second": "String" With Spaces,
      "Third": "String",
      "Fourth": "email"@domain.com,
      "Fifth": "String"
    }
    "Key": "ITEM_3"
  }
]

But that doesn't fix the values that have non-alphanumerical characters or spaces in them. What would be the best solution to this?

CodePudding user response：

Take a look at hjson module. It can fix bad formatted JSON files, as simple as this:

$ cat bad.json 
[
  {
    "Value": "String",
    "Key": "ITEM_1"
  },
  {
    "Value": "String",
    "Key": "ITEM_2"
  },
  {
    "Value": {
      "First": "String",
      Second: String With Spaces,
      "Third": "String",
      Fourth: [email protected],
      "Fifth": String
    }
    "Key": "ITEM_3"
  }
]

$ python -m hjson.tool -j bad.json
[
  {
    "Value": "String",
    "Key": "ITEM_1"
  },
  {
    "Value": "String",
    "Key": "ITEM_2"
  },
  {
    "Value": {
      "First": "String",
      "Second": "String With Spaces,",
      "Third": "String",
      "Fourth": "[email protected],",
      "Fifth": "String"
    },
    "Key": "ITEM_3"
  }
]

CodePudding user response：

It would have to be more fully tested, but it appears you could write

rgx = r'(?<!\")\b[^\":\r\n] (?=:)|(?<=: )[^\",\r\n]*(?=,| *$)'

re.sub(rgx, '"\g<0>"', str, re.MULTILINE))

Python demo^_<-_\(ツ)/^_->Regex demo

The regular expression can be broken down as follows:

(?<!\")      # negative lookbehind asserts current position is not preceded
             # by a double-quote
\b           # match a word boundary
[^\":\r\n]   # match 1 or more characters other than those indicated
(?=:)        # positive lookahead asserts the next character is ':'
|            # or 
(?<=: )      # positive lookbehind asserts current position is preceded by ': '
[^\",\r\n]*  # match 1 or more characters other than those indicated
(?=,| *$)    # positive lookahead asserts current position is followed
             # by a comma or by zero or more spaces at the end of a line