Home > Back-end >  Concatenating Output from Folder
Concatenating Output from Folder

Time:11-01

I have thousands of PDF documents that I am trying to comb through and pull out only certain data. I have successfully created a script that goes through each PDF, puts its content into a .txt, and then the final .txt is searched for the requested information. The only part I am stuck on is trying to combine all the data from each PDF into this .txt file. Currenly, each successive PDF simply overwrites the previous data and the search is only performed on the final PDF in the folder. How can I alter this set of code to allow each bit of information to be concatenated into the .txt instead of overwriting?

 $all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
    foreach ($f in $all){
        $outfile = -join ', '
        $text = convert-PDFtoText $outfile
    }

Here is my entire script for reference:

Start-Process powershell.exe -Verb RunAs {

function convert-PDFtoText {
    param(
        [Parameter(Mandatory=$true)][string]$file
    )
    Add-Type -Path "C:\ps\itextsharp.dll"
    $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
    for ($page = 1; $page -le $pdf.NumberOfPages; $page  ){
        $text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
        Write-Output $text
    }
    $pdf.Close()
}


$content = Read-Host "What are we looking for?: "
$file1 = Read-Host "Path to search: "

$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
    $outfile = $f -join ', '
    $text = convert-PDFtoText $outfile
}





$text | Out-File "C:\ps\bulk.txt"
Select-String -Path C:\ps\bulk.txt -Pattern $content | Out-File "C:\ps\select.txt"


Start-Sleep -Seconds 60

}

Any help would be greatly appreciated!

CodePudding user response:

To capture all output across all convert-PDFtoText in a single output file, use a single pipeline with the ForEach-Object cmdlet:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  ForEach-Object { convert-PDFtoText $_.FullName } |
    Out-File "C:\ps\bulk.txt"

A tweak to your convert-PDFtoText function would allow for a more concise and efficient solution:

Make convert-PDFtoText accept Get-ChildItem input directly from the pipeline:

function convert-PDFtoText {
    param(
        [Alias('FullName')        
        [Parameter(Mandatory, ValueFromPipelineByPropertyName)] 
        [string] $file
    )

    begin {
      Add-Type -Path "C:\ps\itextsharp.dll"
    }

    process {
      $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
      for ($page = 1; $page -le $pdf.NumberOfPages; $page  ) {
        [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
      }
      $pdf.Close()
    }

}

This then allows you to simplify the command at the top to:

Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
  convert-PDFtoText |
    Out-File "C:\ps\bulk.txt"
  • Related