I am working on extraction of links and theirs locations from a PDF document.
Each link has a bounding Rect.
<</A 169 0 R/BS<</S/S/Type/Border/W 0>>/Border[0 0 0]/H/N/Rect[97.0153 116.556 185.543 21.5209]/Subtype/Link/Type/Annot>>
Let's look at Rect[97.0153 116.556 185.543 21.5209]
(I got it by copying from my file).
Rectangle is an array of four numbers in default user space units giving the coordinates of the left, bottom, right and top edges respectively.
So we have left = 97.0153
, bottom = 116.556
, right = 185.543
, top = 21.5209
.
As far as I know PDF's user space is a regular Cartesian coordinate system. Therefore,
I expect that the rect's top edge is greater than the bottom's one, so when I want to calculate the rect's height, I use the formula height = top - bottom
. For that particular PDF document height = 21.5209 - 116.556 = -95.036
, i.e. height is negative and I do something wrong.
I must have missed something and made wrong assumptions. Can anybody advise, please?
CodePudding user response:
Concerning your characterization
Rectangle is an array of four numbers in default user space units giving the coordinates of the left, bottom, right and top edges respectively.
you said in a comment
I got it from BBox entry description, too. BBox is declared to have type rectangle and provided with description if rectangle. PDF Rfeference, 3rd edition, p. 616.
But the BBox entry description is not the definition of the rectangle type. The definition is on page 101:
3.8.3 Rectangles
Rectangles are used to describe locations on a page and bounding boxes for a variety of objects, such as fonts. A rectangle is written as an array of four numbers giving the coordinates of a pair of diagonally opposite corners. Typically, the array takes the form
[llx lly urx ury]
specifying the lower-left x, lower-left y, upper-right x, and upper-right y coordinates of the rectangle, in that order.
Note: Although rectangles are conventionally specified by their lower-left and upper-right corners, it is acceptable to specify any two diagonally opposite corners. Applications that process PDF should be prepared to normalize such rectangles in situations where specific corners are required.
Thus, the characterization you found for the BBox entries makes the typical form required for that BBox entry. Other rectangles, though, may still use a non-typical form. Thus, you need to normalize the array (or in your case, take the absolute value of that difference as height).
As an aside, I wouldn't count on BBox entries to always be in that typical form, either. Always be prepared to normalize rectangle arrays.
That being said you should do yourself a favor and not use that old PDF Reference from 2001 anymore but instead use an ISO norm specifying PDF, i.e. ISO 32000-1 or ISO 32000-2. ISO 32000-1 has been published in 2008. ISO 32000-2 has been published in 2017 and updated in 2020.
If you don't want to spend money on the specification, Adobe has shared a copy of ISO 32000-1 with the ISO page headers removed on their web site. To find it, simply search for "PDF32000"; currently it is at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
The only situation in which the PDF 1.4 Reference should still be used is in context with norms based on that very Reference, e.g. PDF/A-1. Otherwise the PDF References are obsolete and had already before been called not normative in nature by the Adobe PDF architect Leonard Rosenthol.