How can I extract diagonal watermark text from PDF using PDFBox ?

After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.

class AngleCollector extends PDFTextStripper {
    private final Set<Integer> angles = new TreeSet<>();

    AngleCollector() throws IOException {}

    Set<Integer> getAngles() {
        return angles;
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        int angle = ExtractText.getAngle(text);
        angle = (angle   360) % 360;
        angles.add(angle);
    }
}

class FilteredTextStripper extends PDFTextStripper {
    FilteredTextStripper() throws IOException {
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        int angle = ExtractText.getAngle(text);
        if (angle == 0) {
            super.processTextPosition(text);
        }
    }
}

final class ExtractText {
    static int getAngle(TextPosition text) {
        //The Matrix containing the starting text position
        Matrix m = text.getTextMatrix().clone();
        m.concatenate(text.getFont().getFontMatrix());
        return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
    }

    private List<String> getAnnots(PDPage page) throws IOException {
        List<String> returnList = new ArrayList<>();
        for (PDAnnotation pdAnnot : page.getAnnotations()) {
                if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
                    returnList.add(pdAnnot.getContents());
                }
        }
        return returnList;
    }

    public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
        for (int p = startPage; p <= endPage;   p) {
            stripper.setStartPage(p);
            stripper.setEndPage(p);
            try {

                PDPage page = document.getPage(p - 1);
                for (var annot : getAnnots(page)) {
                    output.write(annot);
                }

                int rotation = page.getRotation();
                page.setRotation(0);
                var angleCollector = new AngleCollector();
                angleCollector.setStartPage(p);
                angleCollector.setEndPage(p);
                angleCollector.writeText(document, output);

                for (int angle : angleCollector.getAngles()) {
                    // prepend a transformation

                    try (var cs = new PDPageContentStream(document, page,
                            PDPageContentStream.AppendMode.PREPEND, false)) {
                        cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
                    }

                    stripper.writeText(document, output);

                    // remove prepended transformation
                    ((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
                }
                page.setRotation(rotation);

            } catch (IOException ex) {
                System.err.println("Failed to process page "   p   ex);
            }
        }
    }
}

public class pdfTest {
    private pdfTest() {
    }

    public static void main(String[] args) throws IOException {
        var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
        Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
        var etObj = new ExtractText();
        var rawDoc = PDDocument.load(new File(pdfFile));
        PDFTextStripper stripper = new FilteredTextStripper();

        if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
            rawDoc.getDocumentCatalog().getAcroForm().flatten();
        }

        etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
        output.flush();
    }
}

Edit 1: I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?

I am attaching the sample PDFs for references. Sample PDF 1 Sample PDF 2

I require following things using PDFBox

Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.

CodePudding user response：

In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.

"How can I extract diagonal watermark text from PDF using PDFBox ?"

First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.

In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.

double_watermark.pdf

In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.

To answer your question

How can I extract diagonal watermark text from PDF using PDFBox ?

in this context, therefore: You can not.

(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)

"I am also unable to detect form (Across, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"

Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.

AcroForm forms

AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form.

Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.

XFA forms

XFA form definitions are a cuckoo's egg in PDF: They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.

PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.

CodePudding user response：

Form fields extraction by maintaining Proper alignment.

This is solved by setSortByPosition