Home > Software engineering >  How to read the color of a line in a pdf with iText?
How to read the color of a line in a pdf with iText?

Time:01-04

I am reading a pdf file with iText, in Powershell. I read each line. I need to know the color of the line I am reading. I have no idea about how to get that information.

This is the code I have so far:

Add-Type -Path "C:\Users\Ion\Documents\App\Scripts\itextsharp.dll"
$filePath="C:\Users\Scripts\Datos\ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf"  # File to modify
$pdf = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $filePath

$export = ""
foreach($page in 1..($pdf.NumberOfPages)){
    $export =[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
    # $color =  Here I should be able to get the color of the line to process it.    
}
$pdf.Close()

$export | Out-File C:\Users\Scripts\Datos\datos.txt # The modified File

Here is the document I am working with:

enter image description here

So how can that be programmatically done, easy run 3 lines of cmd (depends on one mutool.exe)

md output
REM we could query num pages and set=pages here but this is just a Proof Of Concept so use known 68
for /l %%i in (1,1,68) do mutool convert -o output\text%%i.html test.pdf %%i 
REM from inspection of result we know 
REM red  = font-family:Verdana,serif;font-size:10.0pt;color:#ff0000
REM blue = font-family:Verdana,serif;font-size:10.0pt;color:#3399ff
REM so we can extract those independently
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#ff0000" >>output\text%%c-red.txt
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#3399ff" >>output\text%%c-blue.txt

Result enter image description here

[21]<p style="top:192.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Las respuestas b) y c) son correctas.</span></p>
[34]<p style="top:350.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) El pluralismo pol&#xed;tico.</span></p>
[45]<p style="top:508.3pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) En el T&#xed;tulo Preliminar.</span><span style="font-family:Verdana,serif;font-size:10.0pt;color:#201c1d"> </span></p>
[65]<p style="top:714.9pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">a) Que la dignidad de la persona es fundamento del orden pol&#xed;tico y de la paz social.</span></p>

NOTE there is a slight wrinkle with line 3 as there is also some other colour (a rogue single space as #201c1d) that will need to be split off enter image description here

You can do similar with simple text replacement done in PowerShell for your desired output, or mod the cmds to only export the parts you need, or add other colours etc.

The PDF fonts will be reflected in the HTML as <b>=bold <i>=italic

File: ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf
Created: 9/20/2022 1:38:39 PM
Application: Writer
PDF Producer: OpenOffice 4.1.5
Fonts: 
ArialUnicodeMS (TrueType; embedded)
Verdana (TrueType; embedded)
Verdana-Bold (TrueType; embedded)
Verdana-BoldItalic (TrueType; embedded)
Verdana-Italic (TrueType; embedded)

P.S.

For red blue combined replace last 2 lines with one

for /l %%c in (1,1,68) do type output\text%%c.html |findstr /n "#3399ff #ff0000" >>output\text%%c-red blue.txt

Sample of first 4 red and blue lines on page 3, note second line is Verdana-Bold

12:<p style="top:58.8pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Por ley org&#xe1;nica.</span></p>
13:<p style="top:83.1pt;left:92.3pt;line-height:10.0pt"><b><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Normativa:</span></b></p>
14:<p style="top:95.2pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">La fundamentaci&#xf3;n legal de esta pregunta la encontramos en el art&#xed;culo 57.5 de la  </span></p>
15:<p style="top:107.4pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Constituci&#xf3;n Espa&#xf1;ola, conforme al cual: </span></p>
  • Related