Convert Azure Computer VIsion Read response to MergeText skill related in Azure Cognitive Search-CodePudding

Raw Read response from Azure Computer Vision looks like this:

{
  "status": "succeeded",
  "createdDateTime": "2021-04-08T21:56:17.6819115 00:00",
  "lastUpdatedDateTime": "2021-04-08T21:56:18.4161316 00:00",
  "analyzeResult": {
    "version": "3.2",
    "readResults": [
      {
        "page": 1,
        "angle": 0,
        "width": 338,
        "height": 479,
        "unit": "pixel",
        "lines": [
          {
            "boundingBox": [
              25,
              14
            ],
            "text": "NOTHING",
            "appearance": {
              "style": {
                "name": "other",
                "confidence": 0.971
              }
            },
            "words": [
              {
                "boundingBox": [
                  27,
                  15
                ],
                "text": "NOTHING",
                "confidence": 0.994
              }
            ]
          }
        ]
      }
    ]
  }
}

Copied from here

I want to create custom skill in Azure Cognitive Search that are not using VisionSkill but my own Azure Functions that will use Computer vision client in code.

The problem is, that to pass input to Text.MergeSkill:

{
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert", 
          "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", 
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText", 
          "targetName" : "merged_text"
        }
      ]
    }

i need to convert Read output to form that returns OcrSkill from custom skills. That response must look like this:

{
  "text": "Hello World. -John",
  "layoutText":
  {
    "language" : "en",
    "text" : "Hello World.",
    "lines" : [
      {
        "boundingBox":
        [ {"x":10, "y":10}, {"x":50, "y":10}, {"x":50, "y":30},{"x":10, "y":30}],
        "text":"Hello World."
      },
    ],
    "words": [
      {
        "boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
        "text":"Hello"
      },
      {
        "boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
        "text":"World."
      }
    ]
  }
}

And i copied it from here

My question is, how to convert boundingBox parameter from Read Computer Vision endpoint to form that Text.MergeSkill accept? Do we really need to do that or we just can pass Read response to Text.MergeSkill diffrently?

CodePudding user response：

The built in OCRSkill calls the Cognitive Services Computer Vision Read API for certain languages, and it handles the merging of the text for you via the 'text' output. If at all possible, I would strongly suggest you use this skill instead of writing a custom one.

If you must write a custom skill and merge the output text yourself, per the MergeSkill documentation, the 'text' and 'offsets' inputs are optional. Meaning that you should just be able to directly pass the text from the individual Read API output objects to the MergeSkill via the 'itemsToInsert' input if you just need a way to merge those outputs together into one large text. This would make your skillset look something like this (not tested to know for sure), assuming you are still using the built in AzureSearch image extraction and your custom skill outputs the exact payload that the Read API returns that you shared above.

{
    "skills": [
        {
            "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
            "description": "Custom skill that calls Cognitive Services Computer Vision Read API",
            "uri": "<your custom skill uri>",
            "batchSize": 1,
            "context": "/document/normalized_images/*",
            "inputs": [
                {
                    "name": "image",
                    "source": "/document/normalized_images/*"
                }
            ],
            "outputs": [
                {
                    "name": "readAPIOutput"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
            "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
            "context": "/document",
            "insertPreTag": "",
            "insertPostTag": "\n",
            "inputs": [
                {
                    "name": "itemsToInsert",
                    "source": "/document/normalized_images/*/readAPIOutput/analyzeResult/readResults/*/lines/*/text"
                }
            ],
            "outputs": [
                {
                    "name": "mergedText",
                    "targetName": "merged_text"
                }
            ]
        }
    ]
}

However, if you need to guarantee that the text appears in the correct order based on the bounding boxes, you will likely need to write a custom solution to calculate the positions and recombine the text yourself. Hence the suggestion to use our built in solution in the OCRSkill if at all possible.