How to parse HTML document with < and &gt using BeautifulSoup?-CodePudding

I have mutliple HTML documents that I need to parse for a very specific attribute in the HTML document, but I am not use to the HTMl documents having < and &gt for tags. I know they represent < and > for the tag, but I am seeing if anyone knows how to deal with these issue.

Snippet of the HTML doc:

&lt;score_result&gt;
            &lt;Models&gt;
                &lt;Model&gt;
                    &lt;Id&gt;CLASS&lt;/Id&gt;
                    &lt;Description&gt;Classifier Model 2.0&lt;/Description&gt;
                    &lt;Score&gt;613&lt;/Score&gt;
                    &lt;Messages&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;017111&lt;/Code&gt;
                            &lt;Description&gt;# of bananas, S&amp;amp;Accounts Established&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;P11P&lt;/Code&gt;
                            &lt;Description&gt;Absence of Banana&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;0111&lt;/Code&gt;
                            &lt;Description&gt;Presence of a Banana&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;0111&lt;/Code&gt;
                            &lt;Description&gt;# of Inquiries&lt;/Description&gt;
                        &lt;/Message&gt;
                    &lt;/Messages&gt;
                &lt;/Model&gt;
            &lt;/Models&gt;
        &lt;/score_result&gt;

I am specifically just trying to grab the Score value, <Score>613</Score> from these HTML documents.

I first take my dataframe and put it into a tuple to iterate through each HTML document, create a BeautifulSoup Object, then try to find the tag with .find_all().

I get an empty string every time. I considered also using regex but wanted to see what other people think.

My code:

result = [(x,y) for x,y in zip(df['ID'], df['data'])]

Score_lst = []

for row in result:
    try: 
        Bs_data = BS(row[1])
        Score_lst.append(Bs_data.find_all('score'))
    

    except:
        print('Na')

Expected Output:

Score_lst 

[613,
...,
...,
....]

The ... will be the other values I will parse.

CodePudding user response：

Here is one way to solve this conundrum:

from bs4 import BeautifulSoup as bs

html = '''
&lt;score_result&gt;
            &lt;Models&gt;
                &lt;Model&gt;
                    &lt;Id&gt;CLASS&lt;/Id&gt;
                    &lt;Description&gt;Classifier Model 2.0&lt;/Description&gt;
                    &lt;Score&gt;613&lt;/Score&gt;
                    &lt;Messages&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;017111&lt;/Code&gt;
                            &lt;Description&gt;# of bananas, S&amp;amp;Accounts Established&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;P11P&lt;/Code&gt;
                            &lt;Description&gt;Absence of Banana&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;0111&lt;/Code&gt;
                            &lt;Description&gt;Presence of a Banana&lt;/Description&gt;
                        &lt;/Message&gt;
                        &lt;Message&gt;
                            &lt;MessageType&gt;RC&lt;/MessageType&gt;
                            &lt;Code&gt;0111&lt;/Code&gt;
                            &lt;Description&gt;# of Inquiries&lt;/Description&gt;
                        &lt;/Message&gt;
                    &lt;/Messages&gt;
                &lt;/Model&gt;
            &lt;/Models&gt;
        &lt;/score_result&gt;

'''

soup = bs(bs(html, 'html.parser').text, 'html.parser')
score = soup.select_one('Score')
print('And here is your score:', score.text)

Result in terminal:

And here is your score: 613