I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?
<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
<fullEventsUpdate xmlns="">
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24425412</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
<fullEventUpdate xmlns="">
<event-reference xmlns="">
<event-id xmlns="">24342548</event-id>
<event-update xmlns="">34</event-update>
</event-reference>
</fullEventUpdate>
</fullEventsUpdate>
</eventsUpdate>
from bs4 import BeautifulSoup
dir_path = '20211006085201.xml'
file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')
events = soup.find_all('fullEventUpdate')
print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')
idi = soup.find_all('event-reference')
for x in range(0, len(events)):
idText = (idi[x].event-id.get_text())
print(idText)
CodePudding user response:
The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:
events = soup.select('fullEventUpdate')
for event in events:
print(event.select_one('event-id').text)
Output:
24425412
24342548
More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).
CodePudding user response:
For XML parsing idiomatic approach is to use xpath selectors.
In python this can be easily achieved with parsel
package which is similar to beautifulsoup
but built on top of lxml
for full xpath support:
body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
print(event.xpath('event-id/text()').get())
results in:
24425412
24342548