Home > other >  How to parse XML namespaces in Python 3 and Beautiful Soup 4?
How to parse XML namespaces in Python 3 and Beautiful Soup 4?

Time:09-19

I am trying to parse XML with BS4 in Python 3.

For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.

Why does the first part work, but the second does not?

import requests
from bs4 import BeautifulSoup

input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
  <wb:country id="ABW">
    <wb:iso2Code>AW</wb:iso2Code>
    <wb:name>Aruba</wb:name>
    <wb:region id="LCN" iso2code="ZJ">Latin America &amp; Caribbean </wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
    <wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
    <wb:capitalCity>Oranjestad</wb:capitalCity>
    <wb:longitude>-70.0167</wb:longitude>
    <wb:latitude>12.5167</wb:latitude>
  </wb:country>
  <wb:country id="AFE">
    <wb:iso2Code>ZH</wb:iso2Code>
    <wb:name>Africa Eastern and Southern</wb:name>
    <wb:region id="NA" iso2code="NA">Aggregates</wb:region>
    <wb:adminregion id="" iso2code="" />
    <wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
    <wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
    <wb:capitalCity />
    <wb:longitude />
    <wb:latitude />
  </wb:country>
</wb:countries>

<item>
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00  0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
  <itunes:image href="https://somesite.com/img.jpg"/>
  <itunes:duration>7845</itunes:duration>
  <itunes:explicit>no</itunes:explicit>
  <itunes:episodeType>Full</itunes:episodeType>
</item>
"""

soup = BeautifulSoup(input, 'xml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# Not working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

CodePudding user response:

It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:

soup = BeautifulSoup(xml_string, 'lxml')

# Working
for x in soup.find_all('wb:country'):
    print(x.find('wb:name').text)

# also working
for x in soup.find_all('item'):
    print(x.find('itunes:subtitle').text)

Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.


If you have second document separat use:

for x in soup.find_all('item'):
    print(x.find('subtitle').text)

Example

from bs4 import BeautifulSoup

xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00  0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  <enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
  <itunes:image href="https://somesite.com/img.jpg"/>
  <itunes:duration>7845</itunes:duration>
  <itunes:explicit>no</itunes:explicit>
  <itunes:episodeType>Full</itunes:episodeType>
</item>
"""

soup = BeautifulSoup(input, 'xml')

# working
for x in soup.find_all('item'):
    print(x.find('subtitle').text)

Else you have to define a namespace for your item and can still use XML parser:

<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
  <title>Some string</title>
  <pubDate>Wed, 01 Sep 2022 12:45:00  0000</pubDate>
  <guid isPermaLink="false">4574785</guid>
  <link>https://somesite.com</link>
  <itunes:subtitle>A subtitle</itunes:subtitle>
  ...

When a namespace is defined for an element, all child elements with the same prefix are associated with the same namespace.

  • Related