With libraries like iTextSharp or iText you can extract metadata from PDF documents via a PdfReader:
using (var reader = new PdfReader(pdfBytes))
{
return reader.Metadata == null ? null : Encoding.UTF8.GetString(reader.Metadata);
}
These kind of libraries completely parse the PDF document before being able to soup up the metadata. This will, in my case, lead to high usage of system resources since we get many requests per second, with large PDF's.
Is there a way to extract the metadata from the PDF without completely loading it in memory first?
CodePudding user response:
iText 5.x allows partial reading of PDFs, too, it merely looks a bit more complicated.
Instead of
using (var reader = new PdfReader(pdfBytes))
use
using (var reader = new PdfReader(new RandomAccessFileOrArray(pdfBytes), null, true))
where the final true
requests partial reading.
CodePudding user response:
With PDF4NET you can extract the XMP metadata without loading the entire document in memory:
// This does a minimal parsing of the PDF file and loads
// only a few objects from the file
PDFFile pdfFile = new PDFFile(new MemoryStream(pdfBytes));
string xmpMetadata = pdfFile.ExtractXmpMetadata();
Update 1: code changed to load the file from a byte array
Disclaimer: I work the for company that develops PDF4NET library.