Home > Software engineering >  How to grab an distinct attribute using beautifulsoup from an xml file?
How to grab an distinct attribute using beautifulsoup from an xml file?

Time:08-19

Here is the data I am trying to extract:

<messages>

    <message type="General" code="ER">SECURITY ALERT message(s) found.</message>

    <message type="General">ORDER NUMBER: 7575757</message>

</messages>

I am trying just to grab the Order Number: 7575757

I have tried several methods of getting this attribute but with no success.

First attempt:

def parseTestID(testid):
    dict = {'ORDER NUMBER': testid.split(" ")[0].split(":")[0]}
    return dict

 


parsedData= []
    
for element in bs_data.find_all("messages"):
    for message in element.find_all("message"):
        dict = {'type': message['type'], 'ORDER NUMBER': parseTestID(message.string)['ORDER NUMBER']}
            # append dictionary to list
        parsedData.append(dict)

    # return list
    print(parsedData)

Output:

[{'type': 'General', 'ORDER NUMBER': 'SECURITY'}, {'type': 'General', 'ORDER NUMBER': 'ORDER'}]

Second attempt:

for element in bs_data.find_all("messages"):
    for message in element.find_all("message"):
        print(message.text)

Output:

    SECURITY ALERT message(s) found.
    ORDER NUMBER: FA3JZ0P

I feel that I am close but not quite sure how to grab this specific attribute.

CodePudding user response:

You could get that number with the following:

from bs4 import BeautifulSoup
import re
html = '''
<messages>

    <message type="General" code="ER">SECURITY ALERT message(s) found.</message>

    <message type="General">ORDER NUMBER: 7575757</message>

</messages>
'''
soup = BeautifulSoup(html, 'html.parser')
desired_info = soup.find('message', string = re.compile('ORDER NUMBER:')).text.split(':')[1].strip()
print(desired_info)

This would return just the number:

7575757

If you want the full string, then you can skip splitting the string above. BeautifulSoup documentation (which I strongly reccomend perusing, at this point): https://beautiful-soup-4.readthedocs.io/en/latest/index.html

CodePudding user response:

Another solution, without re:

num = soup.select_one('message:-soup-contains("ORDER NUMBER")').text.split()[-1]
print(num)

Prints:

7575757
  • Related