Home > Enterprise >  iTextSharp extraction cyrillic characters
iTextSharp extraction cyrillic characters

Time:06-08

In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but the cirillic characters missing in output. I'm try to use encoding but it not helped. What can I do with this chars?

   public static string GetText(string filePath)
    {
        ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
        StringBuilder text = new StringBuilder();
        if (File.Exists(filePath)){
            PdfReader pdfReader = new PdfReader(filePath);
            for (int i = 1; i < pdfReader.NumberOfPages; i  )
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string thePage = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
                text.Append(System.Environment.NewLine);
                thePage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(thePage)));
                text.Append(thePage);
            }                pdfReader.Close();
        }            return text.ToString();
    }

CodePudding user response:

iTextSharp is an outdated product that is no longer supported, probably there are problems with text extraction. Here is a simple example of how the extraction text works in ITEXT 7 (the code is in java, but everything is the same for c#).

    String filePath = "test.pdf";
    StringBuilder text = new StringBuilder();
    PdfReader pdfReader = new PdfReader(filePath);
    PdfDocument pdfDocument = new PdfDocument(pdfReader);
    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i  ) {
        PdfPage page = pdfDocument.getPage(i);
        ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
        String thePage = PdfTextExtractor.getTextFromPage(page, strategy);
        text.append(thePage);
    }
    pdfReader.close();
    System.out.print(text);

The code is about the same as in your example, but the text extracts

  • Related