Home > Back-end >  Parsing strange JSON-like format
Parsing strange JSON-like format

Time:12-25

I got some strange JSON-like format from Google Cloud OCR. It doesn't have quoted keys, colons, or commas.

text_annotations {
  description: ","
  bounding_poly {
    vertices {
      x: 485
      y: 237
    }
    vertices {
      x: 492
      y: 237
    }
    vertices {
      x: 492
      y: 266
    }
    vertices {
      x: 485
      y: 266
    }
  }
}

Is there any simple way to parse it, or format as JSON?

I've tried adding quotes, colons, and commas by hand, but it's not the best way.

Got these data using python code:

from google.cloud import vision
import io
client = vision.ImageAnnotatorClient()

feature = vision.Feature(
    type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION)


with io.open(path, 'rb') as image_file:
    content = image_file.read()

image = vision.Image(content=content)

response = client.text_detection(image=image)
print(response)

CodePudding user response:

Finally i just did as people proposed in comments, and used API to print data in more useful format.

CodePudding user response:

OK, making a custom parser is probably overkill, but what about this potentially easy, quick&dirty hack: if the linebreaks and indentation etc. are stable, could you do a search replace on these (via a tool, because then can be automated and also perform multiple replacements) in the form of replacing "\n bounding_poly" with ",\n "bounding_poly": [", which adds the comma to the previous line, adds the colon to the key, adds quotes to the key, turns its contents into an array (you then too get rid of the "vertices" so they become anonymous objects inside the array as the array's elements). Please be aware of newline encoding differences (\n, \r\n, \r).

Ideally also recognize if one of these search strings matches zero times, or attempt to read the final result as JSON to check for error messages.

Not great, but maybe sufficient? Or has the source more variance or is it more dynamic/unstable?

  • Related