I have an xml link (http://api.worldbank.org/v2/countries) with the following data:
<wb:countries xmlns:wb="http://www.worldbank.org" page="1" pages="6" per_page="50" total="299">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code=""/>
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity/>
<wb:longitude/>
<wb:latitude/>
</wb:country>
</wb:countries>
I tried to parse the incomeLevel but it returns (None) how to reach the text (ex: High income) in the xml text using BeautifulSoup? I tried this code but it does not work as it should!
import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
countries= soup.findAll('wb:country')
for country in countries:
name = country.find("wb:name").text
code = country.find('wb:iso2code').text
incomeLevel = country.find('wb:incomeLevel', {"iso2code":"XD"})
print(f"{name}, {code}, {incomeLevel}")
CodePudding user response:
Here's how to use the ElementTree class from the xml module, and especially take into account the namespace:
from xml.etree import ElementTree as ET
ns = {'wb': 'http://www.worldbank.org'}
countries = ET.parse('input.xml').getroot()
for country in countries.findall('wb:country', namespaces=ns):
name = country.find("wb:name", namespaces=ns).text
code = country.find('wb:iso2Code', namespaces=ns).text
incomeLevel = None
for x in country.findall('wb:incomeLevel', namespaces=ns):
if x.get('iso2code') == 'XD':
incomeLevel = x.text
break
print(f"{name}, {code}, {incomeLevel}")
When I run that on the sample you provided, input.xml, I get:
Aruba, AW, High income
Africa Eastern and Southern, ZH, None
CodePudding user response:
Thanks for posting the question. I think you have a few things wrong in the code. This will help you rectify your errors.
- Instead of
findAll
you should use the new method i.e.find_all
from the bs4 API. Please refer this link https://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names - In the second argument of
BeautifulSoup
please specify "lxml-xml" or simply "xml" as it instructs beautifulsoup to generate a XML document otherwise it'll just generate a plain HTML document and in your question you wish to parse and extract stuff from a XML document. Please refer the following link https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Please refer the following code snippet :)
import requests
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries = soup.find_all('wb:country')
for country in countries:
name = country.find('wb:name').text
code = country.find('wb:iso2Code').text
income_level = country.find('wb:incomeLevel').text
print(f'Name: {name} Code: {code} Income Level: {income_level}')
CodePudding user response:
What happens?
The parser has problems to parse the BeautifulSoup
object with xml namespaces well, cause it is created as HTML, thats why you get a None
and you won't get the text while you not adding .text
method. So it is a combination of both.
How to achieve?
Pass xml
as parser to BeautifulSoup
to handle namespaces properly and as valid xml not as html:
soup = BeautifulSoup(response.content, 'xml')
Add the .text
to your result:
incomeLevel = country.find('incomeLevel').text
If you like to get only the countries with incomeLevel and iso2code="XD", change your selector and use css selectors
instead of find_all()
:
countries = soup.select('country:has(incomeLevel[iso2code="XD"])')
for country in countries:
name = country.find("name").text
code = country.find('iso2Code').text
incomeLevel = country.find('incomeLevel').text
print(f"{name}, {code}, {incomeLevel}")
NOTE: xml parser works case sensitive find('iso2code')
won't work, you have to change to find('iso2Code')
Example
NOTE: In new code better use actually syntax find_all()
instead of outdated findAll()
import requests
#import re
from bs4 import BeautifulSoup
url = 'http://api.worldbank.org/v2/countries'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
countries= soup.find_all('country')
for country in countries:
name = country.find("name").text
code = country.find('iso2Code').text
incomeLevel = country.find('incomeLevel').text
print(f"{name}, {code}, {incomeLevel}")
Output
Aruba, AW, High income
Africa Eastern and Southern, ZH, Aggregates
Afghanistan, AF, Low income
Africa, A9, Aggregates
Africa Western and Central, ZI, Aggregates
Angola, AO, Lower middle income
Albania, AL, Upper middle income
Andorra, AD, High income
Arab World, 1A, Aggregates
United Arab Emirates, AE, High income
...