Is there a way to import rotated text from a PDF table such as with tabula-py in python?
I realize I can just rename the column headers in this case, but I was wondering if there is a way to set a parameter for importing rotated text. I don't see any mention of rotation in the readthedocs for tabula-py and haven't found other packages that would do this yet either (although I did see a mention of rotating an entire page- which doesn't fit this use case exactly as renaming the columns would be easier).
Example:
import tabula
list_df = tabula.read_pdf(
'https://sos.oregon.gov/elections/Documents/statistics/G22-Daily-Ballot-Returns.pdf',
pages=3
)
list_df[0]
CodePudding user response:
I just tried using camelot and it correctly reads the rotated text in the columns header: this is the result.
CodePudding user response:
As @Francesco mentioned, there is a particular way in which camelot is a better than tabula-py, since camelot finds the rotated text.
It was a difficult process to install camelot, so I thought to share some of my learnings here.
- Dependency for camelot: ghostwriter https://camelot-py.readthedocs.io/en/master/user/install-deps.html
For a mac brew install ghostscript tcl-tk
and then troubleshoot any errors (many errors for me, but after copy-pasting each error, there was gold at the end of the rainbow).
- Overview of camelot install https://camelot-py.readthedocs.io/en/master/user/install.html
On a mac:
pip install "camelot-py[cv]"
The documentation page currently actually says [base] rather than [cv], but above in the comments it says [cv] (and stack overflow articles say [cv]).
- In python (if you are using jupyter notebook, restart the notebook kernel)
With the following, the rotated column headers are read in just fine.
import camelot
tables = camelot.read_pdf(
'https://sos.oregon.gov/elections/Documents/statistics/G22-Daily-Ballot-Returns.pdf',
pages='all')
tables[3].df