I am reading a pdf file with iText, in Powershell. I read each line. I need to know the color of the line I am reading. I have no idea about how to get that information.
This is the code I have so far:
Add-Type -Path "C:\Users\Ion\Documents\App\Scripts\itextsharp.dll"
$filePath="C:\Users\Scripts\Datos\ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf" # File to modify
$pdf = New-Object iTextSharp.text.pdf.PdfReader -ArgumentList $filePath
$export = ""
foreach($page in 1..($pdf.NumberOfPages)){
$export =[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
# $color = Here I should be able to get the color of the line to process it.
}
$pdf.Close()
$export | Out-File C:\Users\Scripts\Datos\datos.txt # The modified File
Here is the document I am working with:
So how can that be programmatically done, easy run 3 lines of cmd (depends on one mutool.exe)
md output
REM we could query num pages and set=pages here but this is just a Proof Of Concept so use known 68
for /l %%i in (1,1,68) do mutool convert -o output\text%%i.html test.pdf %%i
REM from inspection of result we know
REM red = font-family:Verdana,serif;font-size:10.0pt;color:#ff0000
REM blue = font-family:Verdana,serif;font-size:10.0pt;color:#3399ff
REM so we can extract those independently
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#ff0000" >>output\text%%c-red.txt
for /l %%c in (1,1,68) do type output\text%%c.html |find /n "#3399ff" >>output\text%%c-blue.txt
[21]<p style="top:192.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Las respuestas b) y c) son correctas.</span></p>
[34]<p style="top:350.4pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) El pluralismo político.</span></p>
[45]<p style="top:508.3pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">b) En el Título Preliminar.</span><span style="font-family:Verdana,serif;font-size:10.0pt;color:#201c1d"> </span></p>
[65]<p style="top:714.9pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">a) Que la dignidad de la persona es fundamento del orden político y de la paz social.</span></p>
NOTE there is a slight wrinkle with line 3 as there is also some other colour (a rogue single space as #201c1d) that will need to be split off
You can do similar with simple text replacement done in PowerShell for your desired output, or mod the cmds to only export the parts you need, or add other colours etc.
The PDF fonts will be reflected in the HTML as <b>=bold
<i>=italic
File: ADMINISTRATIVO-AEPSA-SERV.-CENTRALES-modificado.pdf
Created: 9/20/2022 1:38:39 PM
Application: Writer
PDF Producer: OpenOffice 4.1.5
Fonts:
ArialUnicodeMS (TrueType; embedded)
Verdana (TrueType; embedded)
Verdana-Bold (TrueType; embedded)
Verdana-BoldItalic (TrueType; embedded)
Verdana-Italic (TrueType; embedded)
P.S.
For red blue combined replace last 2 lines with one
for /l %%c in (1,1,68) do type output\text%%c.html |findstr /n "#3399ff #ff0000" >>output\text%%c-red blue.txt
Sample of first 4 red and blue lines on page 3, note second line is Verdana-Bold
12:<p style="top:58.8pt;left:56.8pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#ff0000">d) Por ley orgánica.</span></p>
13:<p style="top:83.1pt;left:92.3pt;line-height:10.0pt"><b><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Normativa:</span></b></p>
14:<p style="top:95.2pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">La fundamentación legal de esta pregunta la encontramos en el artículo 57.5 de la </span></p>
15:<p style="top:107.4pt;left:92.3pt;line-height:10.0pt"><span style="font-family:Verdana,serif;font-size:10.0pt;color:#3399ff">Constitución Española, conforme al cual: </span></p>