String split for detection of a text page change from PDF-CodePudding

i'm trying to analyse a PDF document with itextsharp library...the final intent is read all text and split it for every line.

To do this, i use a split function of the readed text... i have complete text in a string var as this.

 Dim RigheTesto As String()
 RigheTesto = testoEstrapolato.Split({vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries)

Split function work fine and i obtain a string array like "Data type: value", one array for every line from original file ...

... but when split encounter a change of page (in original PDF) don't understand is a different line and it unites to previous ...

Do you know how solve this problem please ?

Thanks for your time!

CodePudding user response：

The following shows how to extract text from a PDF file using NuGet package iTextSharp (it's been tested using v5.5.13.2).

Download/install NuGet package iTextSharp

Create a class (name: PdfPageInfo.vb)

Public Class PdfPageInfo
    Public Property PageNumber As Integer
    Public Property Lines As List(Of String) = New List(Of String)
End Class

Create a module (name: HelperiTextSharp.vb)

Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser

Module HelperiTextSharp
    Public Function ExtractText(filename As String) As List(Of PdfPageInfo)
        Dim pageInfoList As List(Of PdfPageInfo) = New List(Of PdfPageInfo)

        Using reader As PdfReader = New PdfReader(filename)
            For i As Integer = 1 To reader.NumberOfPages Step 1

                'create new instance
                Dim pageInfo As PdfPageInfo = New PdfPageInfo()

                'set value
                pageInfo.PageNumber = i

                'get text from PDF page
                Dim pageText As String = PdfTextExtractor.GetTextFromPage(reader, i)

                'split on newline and set value
                pageInfo.Lines = pageText.Split(New String() {vbCrLf, vbCr, vbLf}, StringSplitOptions.RemoveEmptyEntries).ToList()

                'add 
                pageInfoList.Add(pageInfo)
            Next
        End Using

        Return pageInfoList
    End Function
End Module

Usage:

Dim ofd As OpenFileDialog = New OpenFileDialog()
ofd.Filter = "PDF files(*.pdf)|*.pdf"

If ofd.ShowDialog = DialogResult.OK Then
    Dim pdfPageInfoList As List(Of PdfPageInfo) = HelperiTextSharp.ExtractText(ofd.FileName)

    For Each pInfo As PdfPageInfo In pdfPageInfoList
        Debug.WriteLine("Page Number: " & pInfo.PageNumber.ToString())

        For i As Integer = 0 To pInfo.Lines.Count - 1 Step 1
            Debug.WriteLine("[" & i & "]: " & pInfo.Lines(i))
        Next

        Debug.WriteLine("---------------------------------" & vbCrLf)
    Next
End If

Resource:

How to read pdf file in C#? (Working example using iTextSharp)