I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult
(asynchronous), I only get 1 block returned of type PAGE
, never KEY_VALUE_SET
. If I convert my PDF to an image and use the synchronous methods, I do get KEY_VALUE_SET
back but results are completely inaccurate.
Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?
Sample Code below:
StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
String startJobId = startDocumentAnalysisResult.getJobId();
GetDocumentAnalysisResult documentAnalysisResult = null;
String jobStatus = "IN_PROGRESS";
while (jobStatus.equals("IN_PROGRESS")) {
try {
TimeUnit.SECONDS.sleep(10);
GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
.withJobId(startJobId)
.withMaxResults(1);
documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
jobStatus = documentAnalysisResult.getJobStatus();
} catch (Exception e) {
logger.error(e);
}
}
if (!jobStatus.equals("IN_PROGRESS")) {
List<Block> blocks = documentAnalysisResult.getBlocks();
logger.error("block list size " blocks.size());
Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
Map<String, Block> keyMap = new HashMap<>();
Map<String, Block> valueMap = new HashMap<>();
Map<String, Block> blockMap = new HashMap<>();
for (Block block : blocks) {
logger.error("Block Type:" block.getBlockType());
String blockId = block.getId();
blockMap.put(blockId, block);
if (block.getBlockType().equals("KEY_VALUE_SET")) {
if (block.getEntityTypes().contains("KEY")) {
keyMap.put(blockId, block);
} else {
valueMap.put(blockId, block);
}
}
}
keyValueBlockMap.put("keyMap", keyMap);
keyValueBlockMap.put("valueMap", valueMap);
keyValueBlockMap.put("blockMap", blockMap);
Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
for (String key : keyValueRelationShip.keySet()) {
logger.error("Key: " key);
logger.error("Value: " keyValueRelationShip.get(key));
}
}
Synchronous path which results in completely horrible results:
AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));
AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);
CodePudding user response:
You are not using the recommended version for the AWS SDK for Java. You are using a old version and not the recommended one.
You can find textTract V2 examples in the repo linked above.
I am able to get to lines and the corresponding text by using software.amazon.awssdk.services.textract.TextractClient.
For example when i debug through the code using the same PNG as I used in the console, i get the proper result.