How to obtain pdf byte[] from PdfDocument using itext7?-CodePudding

I'm having difficulty understanding how to obtain the content from a PdfDocument. I've learned from previous questions that PdfDocument flushes the content to optimize working with large documents. If my function returns a new PdfDocument, how do I get the byte[] to pass into my other functions?

Even with PdfDocument.GetReader() - I can't seem to find what I'm looking for.

My use-case is as follows:

Get pdf content from an email attachment
Pass the pdf to a helper function, which extracts specific pages from the initial attachment
Pass the new PdfDocument into a function which calls Azure's Forms Recognizer API to read the fields into an object

To summarize: given a PdfDocument only, how can I get/create a byte[] from it?

Here is my code:

public async Task<BaseResponse> Handle(ReceiveEmailCommand command, CancellationToken cancellationToken) {
  var ms = new MemoryStream(command.attachments.First().Content)
  var extractedDocument = pdfService.PreparePdfDocument(ms);
  var analyzedDocument = await formsRecognizerService.AnalyzeDocument(extractedDocument);
  // Do stuff with the analyzed document...
  var response = await FileWebService.AddAnalyzedDocumentToFileSystem(analyzedDocument);
}

The function AnalyzeDocument expects a Stream parameter. I want to pass something like

new Stream(extractedDocument.GetReader().Stream)

Helper function implementations are below:

        public PdfDocument PreparePdfDocument(MemoryStream ms)
        {
            PdfDocument extractedDoc;
            var pdfReader = new PdfReader(ms);
            var pdf = new PdfDocument(pdfReader);
            var doc = new Document(pdf);

            var matches = GetNumberWithPages(pdf);
            if (matches.Count > 0)
            {
                var pageRange = matches
                    .Where(x => x.Number == "125")
                    .Select(x => Convert.ToInt32(x.PageIndex))
                    .ToList();
                extractedDoc = SplitPages(pdf, pageRange.First(), pageRange.Last());
            }
            else
            {
                // If we couldn't parse the PDF then just take first 4, 3 or 2 pages
                try
                {
                    extractedDoc = SplitPages(pdf, 1, 4);
                }
                catch (ITextException)
                {
                    try
                    {
                        extractedDoc = SplitPages(pdf, 1, 3);
                    }
                    catch (ITextException)
                    {
                        try
                        {
                            extractedDoc = SplitPages(pdf, 1, 2);
                        }
                        catch (Exception)
                        {
                            throw;
                        }
                    }
                }
            }

            return extractedDoc;
        }

        private static List<Match> GetNumberWithPages(PdfDocument doc)
        {
            var regex = new Regex(@"\s ([0-9] )\s (\([0-9] \/[0-9] \))\s Page\s ([0-9])\s of\s ([0-9] )");
            var matches = new List<Match>();

            for (int i = 1; i <= doc.GetNumberOfPages(); i  )
            {
                var page = doc.GetPage(i);
                var text = PdfTextExtractor.GetTextFromPage(page);

                if (!string.IsNullOrEmpty(text))
                {
                    var match = regex.Match(text);
                    if (match.Success)
                    {
                        var match = EvaluateMatch(match, i, doc.GetNumberOfPages());
                        if (match != null)
                        {
                            matches.Add(match);
                        }
                    }
                }
            }

            return matches;
        }

        private static Match? EvaluateMatch(Match match, int pageIndex, int totalPages)
        {
            if (match.Captures.Count == 1 && match.Groups.Count == 5)
            {
                var match = new Match
                {
                    Number = match.Groups[1].Value,
                    Version = match.Groups[2].Value,
                    PageIndex = pageIndex.ToString(),
                    TotalPages = totalPages.ToString()
                };

                return match;
            }
            else
            {
                return null;
            }
        }

        public PdfDocument SplitPages(PdfDocument doc, int startIndex, int endIndex)
        {
            var outputDocument = CreatePdfDocument();
            doc.CopyPagesTo(startIndex, endIndex, outputDocument);

            return outputDocument;
        }

        public PdfDocument CreatePdfDocument()
        {
            var baos = new ByteArrayOutputStream();
            var writer = new PdfWriter(baos);
            var pdf = new PdfDocument(writer);
            
            return pdf;
        }

CodePudding user response：

I'm having difficulty understanding how to obtain the content from a PdfDocument.

You don't!

When you create a PdfDocument to write to, you initialize it with a PdfWriter. That PdfWriter in turn has been initialized to write somewhere. If you want to access the final PDF, you have to close the PdfDocument and look at that somewhere. Also it is not easy to retrieve that somewhere from the PdfWriter as it is wrapped in a number of layers therein. Thus, you should keep a reference to that somewhere close by.

Thus, your ByteArrayOutputStream usually wouldn't be created hidden in some method CreatePdfDocument but instead in the base method and forwarded to other methods as parameter. Then you can eventually retrieve its data. If you need to create your ByteArrayOutputStream hidden like that, you can return a Pair of PdfDocument and ByteArrayOutputStream instead of the plain PdfDocument.

By the way, the idea behind this architecture is that iText tries to write as much PDF content as possible to that somewhere output as early as possible and free the memory. This allows it to create large documents without requiring a similarly large amount of memory.

when I return the stream I cannot access a closed stream

The ByteArrayOutputStream essentially is a MemoryStream; so you can in particular call ToArray to retrieve the finished PDF even if it's closed.

If you need the ByteArrayOutputStream as a regular stream, simply call PdfWriter.SetCloseStream(false) for your writer to prevent the close of the PdfDocument from also closing the stream.