Ideas to extract specific invoice pdf data for different formats and convert to Excel-CodePudding

I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.

The objectives are the following:

First of all, the data to be extracted would be the following:

https://i.stack.imgur.com/6RLo2.png

In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).

Once this is extracted, I would like to display this in an Excel Spreadsheet.

Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.

Thanks for you help and advice.

CodePudding user response：

I don't believe there is a clean and consistent way to do this yet.
If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.

This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.

This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.

CodePudding user response：

In Excel you may want to use PowerQuery to read the pdf:

https://docs.microsoft.com/en-us/power-query/connectors/pdf

Then you can further process to extract the data you want within PowerQuery.