I have the OCR'd text of a bibliography of periodicals that contains structured entries. I would like to use the Invisible XML standard to extract and parse the entries.
Example input:
1 2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge,
NJ. Published by Word Up! Video, Inc. Last issue 66 pages.
Height 28 cm. Line drawings; Photographs (some in color);
Commercial advertising; Table of contents. Previous editor(s):
Marica A. Cole. ISSN 1056-4632. LC card no. sn91-1965.
OCLC no. 23715422. Subject focus and/or Features: Hip hop
culture, Music, Rap music.
WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993
6561 The Zora Neale Hurston Forum. 1986-. Frequency:
Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston
Forum, P.O. Box 550, Morgan State University, Baltimore,
MD 21239. $15 for individuals and institutions. Telephone:
(301) 444-3435. Published by Zora Neale Hurston Society.
Last issue 69 pages. Last volume 142 pages. Height 23 cm.
Photographs; Table of contents. ISSN 1051-6867. LC card no.
90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale, Literature, Literary criticism.
MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994
TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987
WU v.l, n.l- AP/Z893/N345 Fall, 1986
6562 Zwanna: Son of Zulu. 1993-. Frequency: Unknown.
Nabile P. Hage, Editor, Zwanna, P.O. Box 38261, Atlanta, GA
30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32
pages. Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961. Subject focus and/or
Features: Comic books, strips, etc.
WHi v.l, n.l Pam 00-305 Apr/May, 1993
Each entry begins with an entry number, followed by one or more whitespace characters, followed by descriptive text split over newlines.
iXML grammar
data: entry .
entry: -#a, entrynum, " " , content .
entrynum: -digit .
digit: ["1"-"9"] .
content: ~[] ; -#a .
This initial attempt at an iXML grammar produces an ambiguous parse (using the CoffeePot iXML processor).
Output
<data xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<entry>
<entrynum>1</entrynum>
<content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,
Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial
advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card
no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap
music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston
Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,
P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and
institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue
69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.
LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,
Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994 TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-
AP/Z893/N345 Fall, 1986</content>
</entry>
<entry>
<entrynum>6562</entrynum>
<content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.
Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.
Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.
Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993
</content>
</entry>
</data>
As a start, I would like to understand how to chunk the entries, and then begin to parse the content: e.g., each entry number is followed by one or more spaces, then an alphanumeric title, which is followed by period, etc.
CodePudding user response:
"Maybe." One of iXML's great strengths is that it can handle ambiguity. That makes grammars much, much easier to write. And if the ambiguous choices are equally valid or if you don't care which ambiguous choice is selected, then it works really well.
For bibliographic data, I suspect that some choices are more valid than others and you do care which choice is selected, which makes it harder. I'm also betting that there's a lot of ambiguity because OCR is imperfect.
I don't think a single iXML grammar is going to parse the input and produce exactly the output you want, but it might form a useful part of some broader strategy. I'd start by trying to divide the bibliography up into separate entries, limiting the grammar to just a single entry. Then I might see if I could work out different classes of entry (book, magazine, journal, etc.) and maybe have different grammars for each.
Good luck!
CodePudding user response:
Your grammar is very very ambiguous, because "~[]" includes #a, so there are dozens of ways to parse the input. You have to determine how to unambiguously identify the start of an entry, and if that is 'if it starts with a number', then you also have to prevent lines that begin with a number from being recognised as 'content', for example,
content: line .
line: ~["0"-"9"], ~[#a]*, #a.
If you want to track down ambiguity, you can try my implementation (https://homepages.cwi.nl/~steven/ixml/tutorial/run.html) which is much slower than Norm's, but gives potentially useful information about the source of ambiguity.