Problem with accessing XML-attributes via xpath-CodePudding

I have some XML that consists of a lot of repitions of the following xml-structure:

<record>
<header>
<identifier>oai:dnb.de/dnb:reiheO/1254645608</identifier><datestamp>2022-04-01T23:49:32Z</datestamp>
<setspec>dnb:reiheO</setspec>
</header>
<metadata>
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dnb="http://d-nb.de/standards/dnbterms" xmlns:tel="http://krait.kb.nl/coop/tel/handbook/telterms.html">
<dc:title>Advantages of Simultaneous In Situ Multispecies Detection for Portable Emission Measurement Applications / Luigi Biondo, Henrik Gerken, Lars Illmann, Tim Steinhaus, Christian Beidl, Andreas Dreizler, Steven Wagner</dc:title>
<dc:creator>Biondo, Luigi Verfasser]</dc:creator>
<dc:creator>Gerken, Henrik [Verfasser]</dc:creator>
<dc:creator>[Illmann, Lars [Verfasser]</dc:creator>
<dc:creator>Steinhaus, Tim [Verfasser]</dc:creator>
<dc:creator>Beidl, Christian [Verfasser]</dc:creator>
<dc:creator>Dreizler, Andreas [Verfasser]</dc:creator>
<dc:creator>Wagner, Steven [Verfasser]</dc:creator>
<dc:publisher>Darmstadt : Universitäts- und Landesbibliothek</dc:publisher>
<dc:date>2022</dc:date>
<dc:language>eng</dc:language>
<dc:identifier xsi:type="tel:URN">urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://nbn-resolving.de/urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://d-nb.info/1254645608/34</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://tuprints.ulb.tu-darmstadt.de/21050/</dc:identifier>
<dc:identifier xsi:type="dnb:IDN">1254645608</dc:identifier>
<dc:subject>670 Industrielle und handwerkliche Fertigung</dc:subject>
<dc:rights>lizenzfrei</dc:rights>
<dc:type>Online-Ressource</dc:type>
</dc>
</metadata>
</record>

Able to adress most of the elements and extract the information within, but failing to get to the specific ones where I have to define the attribute as well. I think I am struggling with the xpath, but can't quite figure out, why.

If I try this code, I do get a list of elements, but it is empty:

urn = xml.find_all('.//dc:identifier[@xsi:type="tel:URN"]', namespaces=ns)

The same happens for the less specific version:

urn = xml.find_all('.//dc:identifier', namespaces=ns)

However, this code: test1 = xml.find_all("dc:identifier") works and returns a lovely list of elements, but obviously not just of the identifiers specified as urn.

But this: urn = xml.find_all('dc:identifier[@xsi:type="tel:URN"]', namespaces=ns) returns an empty list again. And whatever combination I try, I either get an empty list or it's not working at all.

Does anyone have an idea, why this is or what else I could try? It's so frustrating to get that list of all ids but to not manage to select the one I need from the xsi:type...

EDIT:

I am getting the data via OAI and am using the following libraries and am using requests and BeautifulSoup. I've also tried ElementTree and lxml.

I literally just store the response from the API in a variable called "xml" and then try the following code, of which some works, and some doesn't:

ids = xml.find_all("identifier")[0].text
print(ids)

urn1 = xml.find_all("dc:identifier")
urn1 = urn1[0].text
print(urn1)

test1 = xml.find_all("dc:identifier")
print(test1)

urn2 = xml.find_all(".//dc:identifier")
print(urn2)

urn3 = xml.find_all("dc:identifier[@xsi:type='tel:URN']")
print(urn3)

First two return the text of the element as expected (I know that the first one is the isolated element in the header, not the first dc:identifier object, this just served testing purposes), the third part returns the list of all elements. The last two, on the other hand, return an empty list, and that is the problem (as I need the specific xsi:type-element specified in the last attempt.

CodePudding user response：

First, your xml is still not well formed since the xsi prefix hasn't been declared. I made up a declaration below just to make the answer work.

Second, you need to use an xml parser like lxml to use xpath.

So all together:

rec = """[your xml above, but with the first dc element now reading:
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="whatever" ...]"""

from lxml import etree
doc = etree.XML(rec)
ns = {"dc":"http://purl.org/dc/elements/1.1/",
      "xsi":"whatever"}
urn2 = doc.xpath("//dc:identifier/text()",namespaces=ns)
urn3 = doc.xpath("//dc:identifier[@xsi:type='tel:URN']/text()",namespaces=ns)

and that should do it

CodePudding user response：

If you are using BeautifulSoup, the find_all() method accepts the element name, not an XPath, for the first parameter.

Method signature: find_all(name, attrs, recursive, string, limit, **kwargs)

If you want to use XPath, then you may need to look at other libraries.

See: can we use XPath with BeautifulSoup?

CodePudding user response：

As mentioned working with valid XML would make things much easier, but there is also a way to go with BeautifulSoup and its standard parser lxml - Well not with xpath but very close to your attempts.

Going with find() / find_all():

soup.find_all('dc:identifier' , {'xsi:type':'tel:URN'})

get a list of texts:

[e.text for e in soup.find_all('dc:identifier' , {'xsi:type':'tel:URN'})]

While using css selectors with select() / select_one() instead of find() / find_all()you have to escape the ::

soup.select('dc\:identifier[xsi\:type="tel:URN"]')

to get a list of the texts combine with list comprehension:

[e.text for e in soup.select('dc\:identifier[xsi\:type="tel:URN"]')]

Example

from bs4 import BeautifulSoup

xml='''
<record>
<header>
<identifier>oai:dnb.de/dnb:reiheO/1254645608</identifier><datestamp>2022-04-01T23:49:32Z</datestamp>
<setspec>dnb:reiheO</setspec>
</header>
<metadata>
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dnb="http://d-nb.de/standards/dnbterms" xmlns:tel="http://krait.kb.nl/coop/tel/handbook/telterms.html">
<dc:title>Advantages of Simultaneous In Situ Multispecies Detection for Portable Emission Measurement Applications / Luigi Biondo, Henrik Gerken, Lars Illmann, Tim Steinhaus, Christian Beidl, Andreas Dreizler, Steven Wagner</dc:title>
<dc:identifier xsi:type="tel:URN">urn:nbn:de:tuda-tuprints-210508#fromFirstRecord</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://nbn-resolving.de/urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://d-nb.info/1254645608/34</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://tuprints.ulb.tu-darmstadt.de/21050/</dc:identifier>
<dc:identifier xsi:type="dnb:IDN">1254645608</dc:identifier>
<dc:subject>670 Industrielle und handwerkliche Fertigung</dc:subject>
<dc:rights>lizenzfrei</dc:rights>
<dc:type>Online-Ressource</dc:type>
</dc>
</metadata>
</record>
<record>
<header>
<identifier>oai:dnb.de/dnb:reiheO/1254645608</identifier><datestamp>2022-04-01T23:49:32Z</datestamp>
<setspec>dnb:reiheO</setspec>
</header>
<metadata>
<dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dnb="http://d-nb.de/standards/dnbterms" xmlns:tel="http://krait.kb.nl/coop/tel/handbook/telterms.html">
<dc:title>Advantages of Simultaneous In Situ Multispecies Detection for Portable Emission Measurement Applications / Luigi Biondo, Henrik Gerken, Lars Illmann, Tim Steinhaus, Christian Beidl, Andreas Dreizler, Steven Wagner</dc:title>
<dc:identifier xsi:type="tel:URN">urn:nbn:de:tuda-tuprints-210508#fromSecondRecord</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://nbn-resolving.de/urn:nbn:de:tuda-tuprints-210508</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://d-nb.info/1254645608/34</dc:identifier>
<dc:identifier xsi:type="tel:URL">http://tuprints.ulb.tu-darmstadt.de/21050/</dc:identifier>
<dc:identifier xsi:type="dnb:IDN">1254645608</dc:identifier>
<dc:subject>670 Industrielle und handwerkliche Fertigung</dc:subject>
<dc:rights>lizenzfrei</dc:rights>
<dc:type>Online-Ressource</dc:type>
</dc>
</metadata>
</record>
'''


soup = BeautifulSoup(xml)

[e.text for e in soup.select('dc\:identifier[xsi\:type="tel:URN"]')]

Output

['urn:nbn:de:tuda-tuprints-210508#fromFirstRecord',
 'urn:nbn:de:tuda-tuprints-210508#fromSecondRecord']