I have huge corpora that I am parsing with lxml
, so I am using iterparse
which makes it easy to read XML on-the-fly. By using iterparse(fh, tag="your_tag")
we can efficiently iterate over nodes in large files.
I wish to do some XPath matching for each major tag in the file, in my case alpino_ds
. For each alpino_ds
node I want to check whether some given XPath matches. I found, however, that an XPath would match on an element, when in reality it is matching on something else in the document - not just the current iterated alpino_ds
element but a consecutive one.
I am puzzled as to why this happens: in the example below, I would expect only one match (in the last alpino_ds
node) but as you can see it matches three times and the matched XPath result is the same item in all three cases (part of the last node)!
from io import BytesIO
import lxml.etree as ET
xml = """<treebank>
<alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.1">
<node begin="0" cat="top" end="4" id="0" rel="top">
<node begin="0" cat="du" end="3" id="1" rel="--">
<node begin="0" conjtype="neven" end="1" frame="complementizer(root)" id="2" lcat="du" lemma="en" pos="comp" postag="VG(neven)" pt="vg" rel="dlink" root="en" sc="root" sense="en" word="en"/>
<node begin="1" cat="np" end="3" id="3" rel="nucl">
<node begin="1" end="2" frame="number(hoofd(sg_num))" id="4" infl="sg_num" lcat="detp" lemma="een" numtype="hoofd" pos="num" positie="vrij" postag="TW(hoofd,vrij)" pt="tw" rel="det" root="één" sense="één" special="hoofd" word="één"/>
<node begin="2" end="3" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="5" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
</node>
</node>
<node begin="3" end="4" frame="punct(punt)" id="6" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
</node>
<sentence>en één printer .</sentence>
<comments>
<comment>Q#WR-P-P-D-0000000006.p.34.s.1|en één printer .|1|1|1.2960516563900006</comment>
</comments>
</alpino_ds>
<alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.2">
<node begin="0" cat="top" end="20" id="0" rel="top">
<node begin="0" cat="smain" end="19" id="1" rel="--">
<node begin="0" cat="np" end="2" id="2" index="1" rel="su">
<node begin="0" end="1" frame="determiner(de,nwh,nmod,pro,nparg)" getal="getal" id="3" infl="de" lcat="detp" lemma="die" naamval="stan" pdtype="pron" persoon="3" pos="det" postag="VNW(aanw,pron,stan,vol,3,getal)" pt="vnw" rel="det" root="die" sense="die" status="vol" vwtype="aanw" wh="nwh" word="Die"/>
<node begin="1" end="2" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="4" lcat="np" lemma="printer" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="printer" sense="printer" word="printer"/>
</node>
<node begin="2" end="3" frame="verb(unacc,sg3,passive)" id="5" infl="sg3" lcat="smain" lemma="worden" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="wordt" wvorm="pv"/>
<node begin="0" cat="ppart" end="19" id="6" rel="vc">
<node begin="0" end="2" id="7" index="1" rel="obj1"/>
<node begin="3" buiging="zonder" end="4" frame="verb(hebben,psp,np_pc_pp(voor))" id="8" infl="psp" lcat="ppart" lemma="gebruiken" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="gebruik" sc="np_pc_pp(voor)" sense="gebruik-voor" word="gebruikt" wvorm="vd"/>
<node begin="4" cat="pp" end="19" id="9" rel="pc">
<node begin="4" end="5" frame="preposition(voor,[aan,door,uit,[in,de,plaats]])" id="10" lcat="pp" lemma="voor" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="voor" sense="voor" vztype="init" word="voor"/>
<node begin="5" cat="np" end="19" id="11" rel="obj1">
<node begin="5" end="6" frame="determiner(het,nwh,nmod,pro,nparg,wkpro)" id="12" infl="het" lcat="detp" lemma="het" lwtype="bep" naamval="stan" npagr="evon" pos="det" postag="LID(bep,stan,evon)" pt="lid" rel="det" root="het" sense="het" wh="nwh" word="het"/>
<node begin="6" end="7" frame="v_noun(intransitive)" getal="mv" graad="basis" id="13" lcat="np" lemma="druk" ntype="soort" pos="verb" postag="N(soort,mv,basis)" pt="n" rel="hd" root="druk" sc="intransitive" sense="druk" special="v_noun" word="drukken"/>
<node begin="7" cat="pp" end="19" id="14" rel="mod">
<node begin="7" end="8" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="15" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
<node begin="8" cat="np" end="19" id="16" rel="obj1">
<node begin="8" end="9" frame="determiner(de)" id="17" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
<node begin="9" end="10" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="18" lcat="np" lemma="tekst" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="tekst" sense="tekst" word="tekst"/>
<node begin="10" cat="pp" end="19" id="19" rel="mod">
<node begin="10" end="11" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="20" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
<node begin="11" cat="conj" end="19" id="21" rel="obj1">
<node begin="14" conjtype="neven" end="15" frame="conj(en)" id="22" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
<node begin="11" cat="np" end="19" id="23" rel="cnj">
<node begin="11" end="12" frame="modal_adverb" id="24" index="2" lcat="advp" lemma="bijvoorbeeld" pos="adv" postag="BW()" pt="bw" rel="mod" root="bijvoorbeeld" sc="modal" sense="bijvoorbeeld" word="bijvoorbeeld"/>
<node begin="12" end="13" frame="determiner(de)" id="25" index="3" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
<node begin="13" end="14" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="26" lcat="np" lemma="naam" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="naam" sense="naam" word="naam"/>
<node begin="16" cat="pp" end="19" id="27" index="4" rel="mod">
<node begin="16" end="17" frame="preposition(op,[af,na])" id="28" lcat="pp" lemma="op" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="op" sense="op" vztype="init" word="op"/>
<node begin="17" cat="np" end="19" id="29" rel="obj1">
<node begin="17" end="18" frame="determiner(de)" id="30" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
<node begin="18" end="19" frame="noun(de,count,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="31" lcat="np" lemma="cd" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="hd" root="cd" sense="cd" word="cd"/>
</node>
</node>
</node>
<node begin="11" cat="np" end="19" id="32" rel="cnj">
<node begin="11" end="12" id="33" index="2" rel="mod"/>
<node begin="12" end="13" id="34" index="3" rel="det"/>
<node begin="15" end="16" frame="noun(het,count,pl)" gen="het" getal="mv" graad="basis" id="35" lcat="np" lemma="adresgegevens" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="adres_gegeven" sense="adres_gegeven" word="adresgegevens"/>
<node begin="16" end="19" id="36" index="4" rel="mod"/>
</node>
</node>
</node>
</node>
</node>
</node>
</node>
</node>
</node>
<node begin="19" end="20" frame="punct(punt)" id="37" lcat="punct" lemma="." pos="punct" postag="LET()" pt="let" rel="--" root="." sense="." special="punt" word="."/>
</node>
<sentence>Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .</sentence>
<comments>
<comment>Q#WR-P-P-D-0000000006.p.34.s.2|Die printer wordt gebruikt voor het drukken van de tekst van bijvoorbeeld de naam en adresgegevens op de cd .|1|1|0.11022457209000547</comment>
</comments>
</alpino_ds>
<alpino_ds version="1.3" id="WR-P-P-D-0000000006.p.34.s.3">
<node begin="0" cat="top" end="25" id="0" rel="top">
<node begin="15" end="16" frame="punct(komma)" id="1" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
<node begin="22" end="23" frame="punct(komma)" id="2" lcat="punct" lemma="," pos="punct" postag="LET()" pt="let" rel="--" root="," sense="," special="komma" word=","/>
<node begin="0" cat="smain" end="25" id="3" rel="--">
<node begin="0" cat="np" end="2" id="4" rel="su">
<node begin="0" end="1" frame="determiner(een)" id="5" infl="een" lcat="detp" lemma="een" lwtype="onbep" naamval="stan" npagr="agr" pos="det" postag="LID(onbep,stan,agr)" pt="lid" rel="det" root="een" sense="een" word="Een"/>
<node begin="1" end="2" frame="noun(het,count,sg)" gen="het" genus="onz" getal="ev" graad="dim" id="6" lcat="np" lemma="robot-arm" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,dim,onz,stan)" pt="n" rel="hd" root="robot_arm_DIM" sense="robot_arm_DIM" word="robot-armpje"/>
</node>
<node begin="2" end="3" frame="verb(hebben,sg3,er_pp_sbar(voor))" id="7" infl="sg3" lcat="smain" lemma="zorgen" pos="verb" postag="WW(pv,tgw,met-t)" pt="ww" pvagr="met-t" pvtijd="tgw" rel="hd" root="zorg" sc="er_pp_sbar(voor)" sense="zorg-voor" tense="present" word="zorgt" wvorm="pv"/>
<node begin="3" cat="pp" end="25" id="8" rel="pc">
<node begin="3" end="4" frame="er_adverb(voor)" id="9" lcat="pp" lemma="ervoor" pos="pp" postag="BW()" pt="bw" rel="hd" root="ervoor" sense="ervoor" special="er" word="ervoor"/>
<node begin="4" cat="cp" end="25" id="10" rel="vc">
<node begin="4" conjtype="onder" end="5" frame="complementizer(dat)" id="11" lcat="cp" lemma="dat" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="dat" sc="dat" sense="dat" word="dat"/>
<node begin="5" cat="conj" end="25" id="12" rel="body">
<node begin="5" cat="ssub" end="13" id="13" rel="cnj">
<node begin="5" cat="np" end="7" id="14" index="1" rel="su">
<node begin="5" end="6" frame="determiner(de)" id="15" infl="de" lcat="detp" lemma="de" lwtype="bep" naamval="stan" npagr="rest" pos="det" postag="LID(bep,stan,rest)" pt="lid" rel="det" root="de" sense="de" word="de"/>
<node begin="6" end="7" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="16" lcat="np" lemma="brander" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="brander" sense="brander" word="branders"/>
</node>
<node begin="9" end="10" frame="verb(unacc,pl,passive)" id="17" infl="pl" lcat="ssub" lemma="worden" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="word" sc="passive" sense="word" tense="present" word="worden" wvorm="pv"/>
<node begin="5" cat="ppart" end="13" id="18" rel="vc">
<node begin="5" end="7" id="19" index="1" rel="obj1"/>
<node begin="7" end="8" frame="adverb" id="20" lcat="advp" lemma="steeds" pos="adv" postag="BW()" pt="bw" rel="mod" root="steeds" sense="steeds" word="steeds"/>
<node begin="8" buiging="zonder" end="9" frame="verb(hebben,psp,np_pc_pp(met))" id="21" infl="psp" lcat="ppart" lemma="laden" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="laad" sc="np_pc_pp(met)" sense="laad-met" word="geladen" wvorm="vd"/>
<node begin="10" cat="pp" end="13" id="22" rel="pc">
<node begin="10" end="11" frame="preposition(met,[mee,[en,al]])" id="23" lcat="pp" lemma="met" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="met" sense="met" vztype="init" word="met"/>
<node begin="11" cat="np" end="13" id="24" rel="obj1">
<node aform="base" begin="11" buiging="met-e" end="12" frame="adjective(e)" graad="basis" id="25" infl="e" lcat="ap" lemma="leeg" naamval="stan" pos="adj" positie="prenom" postag="ADJ(prenom,basis,met-e,stan)" pt="adj" rel="mod" root="leeg" sense="leeg" vform="adj" word="lege"/>
<node begin="12" end="13" frame="noun(de,count,pl)" gen="de" getal="mv" graad="basis" id="26" lcat="np" lemma="cd" ntype="soort" num="pl" pos="noun" postag="N(soort,mv,basis)" pt="n" rel="hd" root="cd" sense="cd" word="cd's"/>
</node>
</node>
</node>
</node>
<node begin="13" conjtype="neven" end="14" frame="conj(en)" id="27" lcat="vg" lemma="en" pos="vg" postag="VG(neven)" pt="vg" rel="crd" root="en" sense="en" word="en"/>
<node begin="14" cat="ssub" end="25" id="28" rel="cnj">
<node begin="14" end="15" frame="determiner(het,nwh,nmod,pro,nparg)" getal="ev" id="29" infl="het" lcat="np" lemma="dat" naamval="stan" pdtype="pron" persoon="3o" pos="det" postag="VNW(aanw,pron,stan,vol,3o,ev)" pt="vnw" rel="su" root="dat" sense="dat" status="vol" vwtype="aanw" wh="nwh" word="dat"/>
<node begin="16" cat="cp" end="22" id="30" rel="mod">
<node begin="16" conjtype="onder" end="17" frame="complementizer(als)" id="31" lcat="cp" lemma="als" pos="comp" postag="VG(onder)" pt="vg" rel="cmp" root="als" sc="als" sense="als" word="als"/>
<node begin="17" cat="ssub" end="22" id="32" rel="body">
<node begin="17" case="both" def="def" end="18" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="33" index="2" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="su" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
<node begin="19" end="20" frame="verb(unacc,pl,passive)" id="34" infl="pl" lcat="ssub" lemma="zijn" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="ben" sc="passive" sense="ben" tense="present" word="zijn" wvorm="pv"/>
<node begin="17" cat="ppart" end="22" id="35" rel="vc">
<node begin="17" end="18" id="36" index="2" rel="obj1"/>
<node begin="18" end="19" frame="verb(hebben,psp,np_pc_pp(van))" id="37" infl="psp" lcat="ppart" lemma="voorzien" pos="verb" postag="WW(pv,tgw,mv)" pt="ww" pvagr="mv" pvtijd="tgw" rel="hd" root="voorzie" sc="np_pc_pp(van)" sense="voorzie-van" word="voorzien" wvorm="pv"/>
<node begin="20" cat="pp" end="22" id="38" rel="pc">
<node begin="20" end="21" frame="preposition(van,[af,uit,vandaan,[af,aan]])" id="39" lcat="pp" lemma="van" pos="prep" postag="VZ(init)" pt="vz" rel="hd" root="van" sense="van" vztype="init" word="van"/>
<node begin="21" end="22" frame="noun(de,mass,sg)" gen="de" genus="zijd" getal="ev" graad="basis" id="40" lcat="np" lemma="audio" naamval="stan" ntype="soort" num="sg" pos="noun" postag="N(soort,ev,basis,zijd,stan)" pt="n" rel="obj1" root="audio" sense="audio" word="audio"/>
</node>
</node>
</node>
</node>
<node begin="23" case="both" def="def" end="24" frame="pronoun(nwh,thi,both,de,both,def,wkpro)" gen="de" getal="mv" id="41" lcat="np" lemma="ze" naamval="stan" num="both" pdtype="pron" per="thi" persoon="3" pos="pron" postag="VNW(pers,pron,stan,red,3,mv)" pt="vnw" rel="obj1" root="ze" sense="ze" special="wkpro" status="red" vwtype="pers" wh="nwh" word="ze"/>
<node begin="24" buiging="zonder" end="25" frame="verb(hebben,sg3,transitive)" id="42" infl="sg3" lcat="ssub" lemma="verplaatsen" pos="verb" positie="vrij" postag="WW(vd,vrij,zonder)" pt="ww" rel="hd" root="verplaats" sc="transitive" sense="verplaats" tense="present" word="verplaatst" wvorm="vd"/>
</node>
</node>
</node>
</node>
</node>
</node>
<sentence>Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd's en dat , als ze voorzien zijn van audio , ze verplaatst</sentence>
<comments>
<comment>Q#WR-P-P-D-0000000006.p.34.s.3|Een robot-armpje zorgt ervoor dat de branders steeds geladen worden met lege cd's en dat , als ze voorzien zijn van audio , ze verplaatst|1|1|-0.4347218970399951</comment>
</comments>
</alpino_ds>
</treebank>
"""
xpath = '//node[@cat="cp" and node[@rel="cmp" and @pt="vg" and number(@begin) < number(../node[@rel="body" and @cat="ssub"]/node[@rel="vc" and @cat="ppart"]/node[@rel="hd" and @pt="ww"]/@begin)] and node[@rel="body" and @cat="ssub" and node[@rel="vc" and @cat="ppart" and node[@rel="hd" and @pt="ww" and number(@begin) < number(../../node[@rel="hd" and @pt="ww"]/@begin)]] and node[@rel="hd" and @pt="ww"]]]'
for _, element in ET.iterparse(BytesIO(str.encode(xml)), tag="alpino_ds", events=("end", )):
result = element.xpath(xpath)
if result:
print("match", ET.tostring(result[0]))
What am I missing here?
CodePudding user response:
With XPath, an absolute path starting with /
searches down from the document node (sometimes also called root node) and if you start with e.g. //node
you select node
elements anywhere in the document (of the context node you call your xpath
function on).
So to select relative to/inside of your selected alpine_ds
elements, use a path starting with .//node
instead of //node
.