Home > Net >  In Python, using beautiful soup- how do I get the text of a XML tag that contains a hyphen
In Python, using beautiful soup- how do I get the text of a XML tag that contains a hyphen

Time:10-07

I'm trying to get the text content on tag 'Event-id' in the XML, but hyphen is not recognizing as an element on the file, I know script is working well because if a replace the hyphen for a underscore in the XML and run the script it works, anybody knows which could be the problem?


<?xml version="1.0" encoding="UTF-8"?>
<eventsUpdate xmlns="http://nateng.com/xsd/NETworks">
    <fullEventsUpdate xmlns="">
        <fullEventUpdate xmlns="">
            <event-reference xmlns="">
                <event-id xmlns="">24425412</event-id>
                <event-update xmlns="">34</event-update>
            </event-reference>
        </fullEventUpdate>
        <fullEventUpdate xmlns="">
            <event-reference xmlns="">
                <event-id xmlns="">24342548</event-id>
                <event-update xmlns="">34</event-update>
            </event-reference>
        </fullEventUpdate>
    </fullEventsUpdate>
</eventsUpdate> 



from bs4 import BeautifulSoup

dir_path = '20211006085201.xml'

file = open(dir_path, encoding='UTF-8')
contents = file.read()
soup = BeautifulSoup(contents, 'xml')

events = soup.find_all('fullEventUpdate')


print(' \n-------', len(events), 'events calculated on ', dir_path, '--------\n')

idi = soup.find_all('event-reference')

for x in range(0, len(events)):
    idText = (idi[x].event-id.get_text())
    print(idText)

CodePudding user response:

The problem is you are dealing with namespaced xml, and for that type of document, you should use css selectors instead:

events = soup.select('fullEventUpdate')
for event in events:
    print(event.select_one('event-id').text)

Output:

24425412
24342548

More generally, in dealing with xml documents, you are probably better off using something which supports xpath (like lxml or ElementTree).

CodePudding user response:

For XML parsing idiomatic approach is to use xpath selectors.

In python this can be easily achieved with parsel package which is similar to beautifulsoup but built on top of lxml for full xpath support:

body = ...
from parsel import Selector
selector = Selector(body)
for event in sel.xpath("//event-reference"):
    print(event.xpath('event-id/text()').get())

results in:

24425412
24342548
  • Related