Textract Form Analysis, Java SDK 1.x-CodePudding

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous methods, I do get KEY_VALUE_SET back but results are completely inaccurate.

Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?

Sample Code below:


        StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
        String startJobId = startDocumentAnalysisResult.getJobId();

        GetDocumentAnalysisResult documentAnalysisResult = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            try {
                TimeUnit.SECONDS.sleep(10);
                GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
                        .withJobId(startJobId)
                        .withMaxResults(1);

                documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
                jobStatus = documentAnalysisResult.getJobStatus();
            } catch (Exception e) {
                logger.error(e);
            }
        }

        if (!jobStatus.equals("IN_PROGRESS")) {
                List<Block> blocks = documentAnalysisResult.getBlocks();
                logger.error("block list size "   blocks.size());

                Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
                Map<String, Block> keyMap = new HashMap<>();
                Map<String, Block> valueMap = new HashMap<>();
                Map<String, Block> blockMap = new HashMap<>();

                for (Block block : blocks) {
                    logger.error("Block Type:"   block.getBlockType());
                    String blockId = block.getId();
                    blockMap.put(blockId, block);
                    if (block.getBlockType().equals("KEY_VALUE_SET")) {
                        if (block.getEntityTypes().contains("KEY")) {
                            keyMap.put(blockId, block);
                        } else {
                            valueMap.put(blockId, block);
                        }
                    }
                }
                keyValueBlockMap.put("keyMap", keyMap);
                keyValueBlockMap.put("valueMap", valueMap);
                keyValueBlockMap.put("blockMap", blockMap);

                Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
                for (String key : keyValueRelationShip.keySet()) {
                    logger.error("Key: "   key);
                    logger.error("Value: "   keyValueRelationShip.get(key));
                }
            }

Synchronous path which results in completely horrible results:

AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName()))); 

AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);

CodePudding user response：

You are not using the recommended version for the AWS SDK for Java. You are using a old version and not the recommended one.

I have tested the

You can find textTract V2 examples in the repo linked above.

I am able to get to lines and the corresponding text by using software.amazon.awssdk.services.textract.TextractClient.

For example when i debug through the code using the same PNG as I used in the console, i get the proper result.