Home > Software design >  How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)
How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)

Time:05-28

Given: I've got a task from my tutor to work with Tesseract software generated output in the file of HOCR format. The elements with certain values of the class attribute and the span elements with certain values of the class attribute are always nested in each other in a certain mutual order (see below). At the lowest level of nesting, i.e. inside the <span class='ocrx_word' ...> tags, the elementary parts of the recognized text (words) are set.

Structure: Mutual nesting level 1 div element ocr_page attribute

Mutual nesting level 2 div element ocr_carea attribute

Mutual nesting level 3 span element ocr_par attribute

Mutual nesting level 4 span element ocr_line attribute

Mutual nesting level 5 span element ocrx_word attribute

Task: By means of the XSLT language, convert such a structure into a simple text, preserving the breakdown of words and paragraphs and not preserving the breakdown of lines. I.e., the signs of a line break (code 10) in the resulting text should be located only at the ends of paragraphs. Other breakdowns should be eliminated and should not be re-entered. Paragraphs with a poetic text (where the breakdown into lines is still important) can be considered ordinary paragraphs.

My text.xml example

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-beta-20210815-8-g7cfcf' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
<div class='ocr_page' id='page_1' title='image "20210818.220522.part.tiff"; bbox 0 0 1422 1437; ppageno 0'>
   <div class='ocr_carea' id='block_1_4' title="bbox 68 924 1362 1432">
    <p class='ocr_par' id='par_1_8' lang='rus' title="bbox 68 924 1361 1076">
     <span class='ocr_line' id='line_1_17' title="bbox 68 924 1361 972; baseline -0.001 -7; x_size 48; x_descenders 7; x_ascenders 12">
      <span class='ocrx_word' id='word_1_96' title='bbox 68 938 123 965; x_wconf 95'>Привет</span>
      <span class='ocrx_word' id='word_1_97' title='bbox 153 926 353 966; x_wconf 96'>Мир!</span>
      <span class='ocrx_word' id='word_1_98' title='bbox 389 924 565 965; x_wconf 94'>Это</span>
      <span class='ocrx_word' id='word_1_99' title='bbox 599 938 625 965; x_wconf 95'>я -</span>
      <span class='ocrx_word' id='word_1_100' title='bbox 659 936 710 965; x_wconf 94'>обычный</span>
      <span class='ocrx_word' id='word_1_101' title='bbox 745 926 992 972; x_wconf 96'>неработающий</span>
      <span class='ocrx_word' id='word_1_102' title='bbox 1025 925 1141 965; x_wconf 96'>текст</span>
      <span class='ocrx_word' id='word_1_103' title='bbox 1178 935 1361 971; x_wconf 96'>или рыба</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

My solution (xslt-fo handler):

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:fo="http://www.w3.org/1999/XSL/Format">
  <xsl:output method="xml" indent="yes"/>
  <xsl:template match="/">
    <fo:root>
      <fo:layout-master-set>
        <fo:simple-page-master master-name="A4-portrait"
              page-height="29.7cm" page-width="21.0cm" margin="2cm">
          <fo:region-body/>
        </fo:simple-page-master>
      </fo:layout-master-set>
      <fo:page-sequence master-reference="A4-portrait">
        <fo:flow flow-name="xsl-region-body">
          <fo:block linefeed-treatment="preserve">
                <xsl:for-each select="//div [@class='ocr_page'] /div [@class='ocr_carea'] / p [@class='ocr_par'] / span [@class='ocr_line'] / span [@class='ocrx_word']">
                  <xsl:value-of select="normalize-space(span [@class='ocrx_word'])" disable-output-escaping="yes"/>
                </xsl:for-each>
          </fo:block>
        </fo:flow>
      </fo:page-sequence>
    </fo:root>
  </xsl:template>
</xsl:stylesheet>

I run it using cmd: fop -xml text.xml -xsl text2fo.xsl -txt text.txt

The resulting output: a txt file with empty lines.

The expected output: a txt file with words of "Привет Мир! Это я, обычный неработающий текст или рыба" text.

What am I doing wrong? Tried nested xsl:for-each code gives out the same kind of behavior.

CodePudding user response:

I see 2 problems in your attempt:

  1. Your instruction:

    <xsl:for-each select="//div [@class='ocr_page'] /div [@class='ocr_carea'] / p [@class='ocr_par'] / span[@class='ocr_line'] / span [@class='ocrx_word']">
    

    selects nothing, because your input XML puts all its elements in a namespace. See here how to solve this.

  2. Once you have it working, this instruction will put you in the context of span. From this context, your next instruction:

     <xsl:value-of select="normalize-space(span [@class='ocrx_word'])" disable-output-escaping="yes"/>
    

    also selects nothing, because span is not a child of itself. It should be:

    <xsl:value-of select="normalize-space(.)"/>
    

    and I doubt you want to disable output escaping in a stylesheet producing an XML result.

  • Related