Convert PDF to CSV or EXCEL-CodePudding

I am trying to convert PDF file to CSV or EXCEL format.

Here is the code I use to convert to CSV format:

public void convert() throws Exception {
            PdfReader pdfReader = new PdfReader("example.pdf");
            PdfDocument pdf = new PdfDocument(pdfReader);;

            int pages = pdf.getNumberOfPages();

            FileWriter csvWriter = new FileWriter("student.csv");

            for (int i = 1; i <= pages; i  ) {
                PdfPage page = pdf.getPage(i);
                String content = PdfTextExtractor.getTextFromPage(page);

                String[] splitContents = content.split("\n");

                boolean isTitle = true;

                for (int j = 0; j < splitContents.length; j  ) {
                    if (isTitle) {
                        isTitle = false;
                        continue;
                    }

                    csvWriter.append(splitContents[j].replaceAll(" ", " "));
                    csvWriter.append("\n");
                }
            }

            csvWriter.flush();
            csvWriter.close();
        }

This code works correctly, but the fact is that the CSV format groups rows without taking into account existing columns (some of them are empty), so I would like to convert this file (PDF) to EXCEL format. The PDF file itself is formed as a table. What do I mean about spaces. For example, in a PDF file, in a table

|   name   |    some data   |            |             |    some data 1    |              |
 ---------- ---------------- ------------ ------------- ------------------- --------------

After converting to a CSV file, the line looks like this:

name some data some data 1

How can I get the same result as a PDF table?

CodePudding user response：

I'd suggest to use PDFBox, like here: Parsing PDF files (especially with tables) with PDFBox or another library that will allow you to check the data in the Table point by point, and will allow you to create a table by column width (something like Table table = page.getTable(dividers)); ).

If the width of the columns changes, you'll have to implement it based on the headers/first data column ([e.g. position.x of the last character of the first word] minus [position.x of the first character of the new word] - you'll have to figure it out yourself), it's hard so you could make it hardcoded in the beginning. Using Foxit Reader PDF App you can easily measure column width. Then, if you don't find any data in a particular column, you will be able to add an empty column in the CSV file. I know from my own experience that it is not easy, so I wish you good luck.