How to parse XML with namespaces in tags using BeautifulSoup?-CodePudding

I have an xml link (http://api.worldbank.org/v2/countries) with the following data:

<wb:countries xmlns:wb="http://www.worldbank.org" page="1" pages="6" per_page="50" total="299">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity/>
<wb:longitude/>
<wb:latitude/>
</wb:country>
</wb:countries>

I tried to parse the incomeLevel but it returns (None) how to reach the text (ex: High income) in the xml text using BeautifulSoup? I tried this code but it does not work as it should!

import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
countries= soup.findAll('wb:country')
for country in countries:
    name = country.find("wb:name").text
    code = country.find('wb:iso2code').text
    incomeLevel = country.find('wb:incomeLevel', {"iso2code":"XD"})
    print(f"{name}, {code}, {incomeLevel}")

CodePudding user response：

Here's how to use the ElementTree class from the xml module, and especially take into account the namespace:

from xml.etree import ElementTree as ET

ns = {'wb': 'http://www.worldbank.org'}

countries = ET.parse('input.xml').getroot()

for country in countries.findall('wb:country', namespaces=ns):
    name = country.find("wb:name", namespaces=ns).text
    code = country.find('wb:iso2Code', namespaces=ns).text
    
    incomeLevel = None
    for x in country.findall('wb:incomeLevel', namespaces=ns):
        if x.get('iso2code') == 'XD':
            incomeLevel = x.text
            break
    
    print(f"{name}, {code}, {incomeLevel}")

When I run that on the sample you provided, input.xml, I get:

Aruba, AW, High income
Africa Eastern and Southern, ZH, None

CodePudding user response：

Thanks for posting the question. I think you have a few things wrong in the code. This will help you rectify your errors.

Instead of findAll you should use the new method i.e. find_all from the bs4 API. Please refer this link https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
In the second argument of BeautifulSoup please specify "lxml-xml" or simply "xml" as it instructs beautifulsoup to generate a XML document otherwise it'll just generate a plain HTML document and in your question you wish to parse and extract stuff from a XML document. Please refer the following link https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Please refer the following code snippet :)

import requests
from bs4 import BeautifulSoup

url = 'http://api.worldbank.org/v2/countries'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries = soup.find_all('wb:country')
for country in countries:
    name = country.find('wb:name').text
    code = country.find('wb:iso2Code').text
    income_level = country.find('wb:incomeLevel').text
    print(f'Name: {name} Code: {code} Income Level: {income_level}')

CodePudding user response：

What happens?

The parser has problems to parse the BeautifulSoup object with xml namespaces well, cause it is created as HTML, thats why you get a None and you won't get the text while you not adding .text method. So it is a combination of both.

How to achieve?

Pass xml as parser to BeautifulSoup to handle namespaces properly and as valid xml not as html:

soup = BeautifulSoup(response.content, 'xml')

Add the .text to your result:

incomeLevel = country.find('incomeLevel').text

If you like to get only the countries with incomeLevel and iso2code="XD", change your selector and use css selectors instead of find_all():

countries = soup.select('country:has(incomeLevel[iso2code="XD"])')
for country in countries:
    name = country.find("name").text
    code = country.find('iso2Code').text
    incomeLevel = country.find('incomeLevel').text
    print(f"{name}, {code}, {incomeLevel}")

NOTE: xml parser works case sensitive find('iso2code') won't work, you have to change to find('iso2Code')

Example

NOTE: In new code better use actually syntax find_all() instead of outdated findAll()

import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries= soup.find_all('country')
for country in countries:
    name = country.find("name").text
    code = country.find('iso2Code').text
    incomeLevel = country.find('incomeLevel').text
    print(f"{name}, {code}, {incomeLevel}")

Output

Aruba, AW, High income
Africa Eastern and Southern, ZH, Aggregates
Afghanistan, AF, Low income
Africa, A9, Aggregates
Africa Western and Central, ZI, Aggregates
Angola, AO, Lower middle income
Albania, AL, Upper middle income
Andorra, AD, High income
Arab World, 1A, Aggregates
United Arab Emirates, AE, High income
...