Home > Blockchain >  Importing rotated text from a PDF table such as with tabula-py in python
Importing rotated text from a PDF table such as with tabula-py in python

Time:11-20

Is there a way to import rotated text from a PDF table such as with tabula-py in python?

I realize I can just rename the column headers in this case, but I was wondering if there is a way to set a parameter for importing rotated text. I don't see any mention of rotation in the readthedocs for tabula-py and haven't found other packages that would do this yet either (although I did see a mention of rotating an entire page- which doesn't fit this use case exactly as renaming the columns would be easier).

Example:

import tabula

list_df = tabula.read_pdf(
    'https://sos.oregon.gov/elections/Documents/statistics/G22-Daily-Ballot-Returns.pdf',
    pages=3
)

list_df[0]

Screen Capture of Result and PDF

CodePudding user response:

I just tried using camelot and it correctly reads the rotated text in the columns header: this is the result.

CodePudding user response:

As @Francesco mentioned, there is a particular way in which camelot is a better than tabula-py, since camelot finds the rotated text.

It was a difficult process to install camelot, so I thought to share some of my learnings here.

  1. Dependency for camelot: ghostwriter https://camelot-py.readthedocs.io/en/master/user/install-deps.html

For a mac brew install ghostscript tcl-tk and then troubleshoot any errors (many errors for me, but after copy-pasting each error, there was gold at the end of the rainbow).

  1. Overview of camelot install https://camelot-py.readthedocs.io/en/master/user/install.html

On a mac: pip install "camelot-py[cv]"

The documentation page currently actually says [base] rather than [cv], but above in the comments it says [cv] (and stack overflow articles say [cv]).

  1. In python (if you are using jupyter notebook, restart the notebook kernel)

With the following, the rotated column headers are read in just fine.

import camelot
tables = camelot.read_pdf(
    'https://sos.oregon.gov/elections/Documents/statistics/G22-Daily-Ballot-Returns.pdf',
    pages='all')
tables[3].df
  • Related