Home > Enterprise >  Extracting records from text using Invisible XML
Extracting records from text using Invisible XML

Time:10-18

I have the OCR'd text of a bibliography of periodicals that contains structured entries. I would like to use the Invisible XML standard to extract and parse the entries.

Example input:


1  2  Hype.  1990?- 1993.  Frequency:  Bimonthly.  River  Edge, 

NJ.  Published  by  Word  Up!  Video,  Inc.  Last  issue  66  pages. 
Height  28  cm.  Line  drawings;  Photographs  (some  in  color); 
Commercial  advertising;  Table  of  contents.  Previous  editor(s): 
Marica  A.  Cole.  ISSN  1056-4632.  LC  card  no.  sn91-1965. 
OCLC  no.  23715422.  Subject  focus  and/or  Features:  Hip  hop 
culture,  Music,  Rap  music. 

WHi  v.l,  n.6;  v.2,  n.5  Pam  01-5450  Aug,  1992;  Aug,  1993 

6561  The  Zora  Neale  Hurston  Forum.  1986-.  Frequency: 
Semiannual.  Ruth  T.  Sheffey,  Editor,  The  Zora  Neale  Hurston 
Forum,  P.O.  Box  550,  Morgan  State  University,  Baltimore, 

MD  21239.  $15  for  individuals  and  institutions.  Telephone: 
(301)  444-3435.  Published  by  Zora  Neale  Hurston  Society. 

Last  issue  69  pages.  Last  volume  142  pages.  Height  23  cm. 
Photographs;  Table  of  contents.  ISSN  1051-6867.  LC  card  no. 
90-649339.  OCLC  no.  15610848.  Subject  focus  and/or  Features:  Hurston,  Zora  Neale,  Literature,  Literary  criticism. 
MdBMC  v.l,  n.l-v.8,  n.2  Special  Collections  Fall,  1986-Spring, 

1994 

TxDw  v.l,  n.l;  v.2,  n.l  Woman’s  Collection  Fall,  1986;  Fall,  1987 
WU  v.l,  n.l-  AP/Z893/N345  Fall,  1986
6562  Zwanna:  Son  of  Zulu.  1993-.  Frequency:  Unknown. 
Nabile  P.  Hage,  Editor,  Zwanna,  P.O.  Box  38261,  Atlanta,  GA 
30334.  Published  by  Dark  Zulu  Lies  Comics,  Inc.  Last  issue  32 
pages.  Height  28  cm.  Line  drawings  (some  in  color);  Commercial  advertising.  OCLC  no.  28389961.  Subject  focus  and/or 
Features:  Comic  books,  strips,  etc. 

WHi  v.l,  n.l  Pam  00-305  Apr/May,  1993 

Each entry begins with an entry number, followed by one or more whitespace characters, followed by descriptive text split over newlines.

iXML grammar

data: entry  .
entry: -#a, entrynum, " " , content .
entrynum: -digit  .
digit: ["1"-"9"] .
content: ~[] ; -#a  .

This initial attempt at an iXML grammar produces an ambiguous parse (using the CoffeePot iXML processor).

Output

<data xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
  <entry>
    <entrynum>1</entrynum>
    <content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,
      Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial
      advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card
      no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap
      music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston
      Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,
      P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and
      institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue
      69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.
      LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,
      Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
      1994 TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-
      AP/Z893/N345 Fall, 1986</content>
  </entry>
  <entry>
    <entrynum>6562</entrynum>
    <content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.
      Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.
      Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.
      Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993
    </content>
  </entry>
</data>

As a start, I would like to understand how to chunk the entries, and then begin to parse the content: e.g., each entry number is followed by one or more spaces, then an alphanumeric title, which is followed by period, etc.

CodePudding user response:

"Maybe." One of iXML's great strengths is that it can handle ambiguity. That makes grammars much, much easier to write. And if the ambiguous choices are equally valid or if you don't care which ambiguous choice is selected, then it works really well.

For bibliographic data, I suspect that some choices are more valid than others and you do care which choice is selected, which makes it harder. I'm also betting that there's a lot of ambiguity because OCR is imperfect.

I don't think a single iXML grammar is going to parse the input and produce exactly the output you want, but it might form a useful part of some broader strategy. I'd start by trying to divide the bibliography up into separate entries, limiting the grammar to just a single entry. Then I might see if I could work out different classes of entry (book, magazine, journal, etc.) and maybe have different grammars for each.

Good luck!

CodePudding user response:

Your grammar is very very ambiguous, because "~[]" includes #a, so there are dozens of ways to parse the input. You have to determine how to unambiguously identify the start of an entry, and if that is 'if it starts with a number', then you also have to prevent lines that begin with a number from being recognised as 'content', for example,

content: line .

line: ~["0"-"9"], ~[#a]*, #a.

If you want to track down ambiguity, you can try my implementation (https://homepages.cwi.nl/~steven/ixml/tutorial/run.html) which is much slower than Norm's, but gives potentially useful information about the source of ambiguity.

  • Related