I am using Camelot to extract tables from PDF files. While this works very well, it extracts the text only, it does not extract the hyperlinks that are embedded in the tables.
Is there a way of using Camelot or a similar package to extract table text and hyperlinks embedded within tables?
Thanks!
CodePudding user response:
most applications such as tablular text extractors simply scrape the visible surface as plain text and actually hyperlinks are often stored elsewhere in the pdf which is NOT a WTSIWYG word processor file.
So, if you're lucky you can extract the co-ordinates (without their page allocation like this)
C:\Users\lz02\Downloads>type "7 - 20 November 2022 (003).pdf" |findstr /i "(http"
<</Subtype/Link/Rect[ 69.75 299.75 280.63 313.18] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/complaint/) >>/StructParent 5>>
<</Subtype/Link/Rect[ 219.37 120.85 402.47 133.06] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/handle-complaint/) >>/StructParent 1>>
<</Subtype/Link/Rect[ 146.23 108.64 329.33 120.85] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/handle-complaint/) >>/StructParent 2>>
<</Subtype/Link/Rect[ 412.48 108.64 525.55 120.85] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code) >>/StructParent 3>>
<</Subtype/Link/Rect[ 69.75 96.434 95.085 108.64] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.ofcom.org.uk/tv-radio-and-on-demand/broadcast-codes/broadcast-code) >>/StructParent 4>>
<</Subtype/Link/Rect[ 69.75 683.75 317.08 697.18] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(http://www.bbc.co.uk/complaints/comp-reports/ecu/) >>/StructParent 7>>
<</Subtype/Link/Rect[ 463.35 604.46 500.24 617.89] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/reporting-scotland-bbc-one-scotland-20-december-2021) >>/StructParent 8>>
<</Subtype/Link/Rect[ 463.35 577.11 500.24 590.54] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/book-of-the-week-preventable-radio-4-19-april-2022) >>/StructParent 9>>
<</Subtype/Link/Rect[ 463.35 522.4 521.41 535.83] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/the-one-show-bbc-one-6-october-2022) >>/StructParent 10>>
<</Subtype/Link/Rect[ 463.35 495.04 518.04 508.47] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/news-6pm-bbc-one-22-september-2022) >>/StructParent 11>>
<</Subtype/Link/Rect[ 463.35 469.04 518.04 482.47] /BS<</W 0>>/F 4/A<</Type/Action/S/URI/URI(https://www.bbc.co.uk/contact/ecu/news-1030am-bbc-news-channel-20-september-2022) >>/StructParent 12>>
NOTE, the random order, to find which page they belong to you need to traceback to their /StructParent ##
CodePudding user response:
Yes, it's possible. Camelot, by default, only extracts the text from PDF files, but it also provides options to extract additional information, such as the position and size of text blocks, as well as the coordinates of the lines and curves that define the table cells. With this information, it is possible to identify the table cells that contain hyperlinks, and to extract the text and the hyperlink destination for each of these cells.
Here is an example of how this can be done using Camelot:
import camelot
# Load the PDF file
pdf = camelot.read_pdf("example.pdf")
# Extract the tables, including their coordinates and text blocks
tables = pdf.extract(flavor="lattice", tables=None, spreadsheets=None,
str_columns_map=None, columns=None, suppress_stdout=False)
# Iterate over the tables
for table in tables:
# Iterate over the rows in the table
for row in table.data:
# Iterate over the cells in the row
for cell in row:
# If the cell contains a hyperlink, extract the text and the hyperlink destination
if cell.text.startswith("http"):
text = cell.text
hyperlink = cell.bbox[0]
print(text, hyperlink)