I'm having difficulty understanding how to obtain the content from a PdfDocument. I've learned from previous questions that PdfDocument flushes the content to optimize working with large documents. If my function returns a new PdfDocument, how do I get the byte[] to pass into my other functions?
Even with PdfDocument.GetReader() - I can't seem to find what I'm looking for.
My use-case is as follows:
- Get pdf content from an email attachment
- Pass the pdf to a helper function, which extracts specific pages from the initial attachment
- Pass the new PdfDocument into a function which calls Azure's Forms Recognizer API to read the fields into an object
To summarize: given a PdfDocument only, how can I get/create a byte[] from it?
Here is my code:
public async Task<BaseResponse> Handle(ReceiveEmailCommand command, CancellationToken cancellationToken) {
var ms = new MemoryStream(command.attachments.First().Content)
var extractedDocument = pdfService.PreparePdfDocument(ms);
var analyzedDocument = await formsRecognizerService.AnalyzeDocument(extractedDocument);
// Do stuff with the analyzed document...
var response = await FileWebService.AddAnalyzedDocumentToFileSystem(analyzedDocument);
}
The function AnalyzeDocument expects a Stream parameter. I want to pass something like
new Stream(extractedDocument.GetReader().Stream)
Helper function implementations are below:
public PdfDocument PreparePdfDocument(MemoryStream ms)
{
PdfDocument extractedDoc;
var pdfReader = new PdfReader(ms);
var pdf = new PdfDocument(pdfReader);
var doc = new Document(pdf);
var matches = GetNumberWithPages(pdf);
if (matches.Count > 0)
{
var pageRange = matches
.Where(x => x.Number == "125")
.Select(x => Convert.ToInt32(x.PageIndex))
.ToList();
extractedDoc = SplitPages(pdf, pageRange.First(), pageRange.Last());
}
else
{
// If we couldn't parse the PDF then just take first 4, 3 or 2 pages
try
{
extractedDoc = SplitPages(pdf, 1, 4);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 3);
}
catch (ITextException)
{
try
{
extractedDoc = SplitPages(pdf, 1, 2);
}
catch (Exception)
{
throw;
}
}
}
}
return extractedDoc;
}
private static List<Match> GetNumberWithPages(PdfDocument doc)
{
var regex = new Regex(@"\s ([0-9] )\s (\([0-9] \/[0-9] \))\s Page\s ([0-9])\s of\s ([0-9] )");
var matches = new List<Match>();
for (int i = 1; i <= doc.GetNumberOfPages(); i )
{
var page = doc.GetPage(i);
var text = PdfTextExtractor.GetTextFromPage(page);
if (!string.IsNullOrEmpty(text))
{
var match = regex.Match(text);
if (match.Success)
{
var match = EvaluateMatch(match, i, doc.GetNumberOfPages());
if (match != null)
{
matches.Add(match);
}
}
}
}
return matches;
}
private static Match? EvaluateMatch(Match match, int pageIndex, int totalPages)
{
if (match.Captures.Count == 1 && match.Groups.Count == 5)
{
var match = new Match
{
Number = match.Groups[1].Value,
Version = match.Groups[2].Value,
PageIndex = pageIndex.ToString(),
TotalPages = totalPages.ToString()
};
return match;
}
else
{
return null;
}
}
public PdfDocument SplitPages(PdfDocument doc, int startIndex, int endIndex)
{
var outputDocument = CreatePdfDocument();
doc.CopyPagesTo(startIndex, endIndex, outputDocument);
return outputDocument;
}
public PdfDocument CreatePdfDocument()
{
var baos = new ByteArrayOutputStream();
var writer = new PdfWriter(baos);
var pdf = new PdfDocument(writer);
return pdf;
}
CodePudding user response:
I'm having difficulty understanding how to obtain the content from a PdfDocument.
You don't!
When you create a PdfDocument
to write to, you initialize it with a PdfWriter
. That PdfWriter
in turn has been initialized to write somewhere. If you want to access the final PDF, you have to close the PdfDocument
and look at that somewhere. Also it is not easy to retrieve that somewhere from the PdfWriter
as it is wrapped in a number of layers therein. Thus, you should keep a reference to that somewhere close by.
Thus, your ByteArrayOutputStream
usually wouldn't be created hidden in some method CreatePdfDocument
but instead in the base method and forwarded to other methods as parameter. Then you can eventually retrieve its data. If you need to create your ByteArrayOutputStream
hidden like that, you can return a Pair
of PdfDocument
and ByteArrayOutputStream
instead of the plain PdfDocument
.
By the way, the idea behind this architecture is that iText tries to write as much PDF content as possible to that somewhere output as early as possible and free the memory. This allows it to create large documents without requiring a similarly large amount of memory.
when I return the stream I cannot access a closed stream
The ByteArrayOutputStream
essentially is a MemoryStream
; so you can in particular call ToArray
to retrieve the finished PDF even if it's closed.
If you need the ByteArrayOutputStream
as a regular stream, simply call PdfWriter.SetCloseStream(false)
for your writer to prevent the close
of the PdfDocument
from also closing the stream.