I'm sorry, if that is a really basic questions, but I'm sitting in front of that problem for hours already and just can't make it work.
I'm working with the British National Corpus (which files are in XML-format) and I want to extract the attributes of different persons in those files. The part I'm working with is structured like this:
<bncDoc>
<teiHeader>
<profileDesc>
<particDesc n="C196">
<person ageGroup="X" xml:id="PS21Y" role="unspecified" sex="f" soc="UU" dialect="NONE" firstLang="EN-GBR" educ="X">
<persName>j. hammond</persName>
<occupation>interviewer</occupation>
</person>
<person ageGroup="X" xml:id="PS220" role="unspecified" sex="m" soc="UU" dialect="XIS" firstLang="EN-GBR" educ="X">
<persName>Bhagan</persName>
</person>
</particDesc>
</profileDesc>
</teiHeader>
</bncDoc>
I'm trying to extract "id", "sex", "soc", and "ageGroup" of the "person" elements. But I just don't know how it works with those "xml:id"'s. The way I'm trying to do it (like shown below), doesn't work. It works for "sex", "soc", and "ageGroup", but not for "xml:id". Does anyone know, how to make it work? That would help me a lot! :)
for i in root.findall('teiHeader/profileDesc/particDesc/person'):
tmp = []
tmp.append(i.get('id'))
tmp.append(i.get('sex'))
tmp.append(i.get('soc'))
tmp.append(i.get('ageGroup'))
CodePudding user response:
It works if you use
i.get('{http://www.w3.org/XML/1998/namespace}id')
This looks a bit ugly, but it has to do with the fact that xml:
is a special namespace prefix that is bound to the http://www.w3.org/XML/1998/namespace
URI. See https://www.w3.org/XML/1998/namespace.