I have mutliple HTML documents that I need to parse for a very specific attribute in the HTML document, but I am not use to the HTMl documents having < and > for tags. I know they represent < and > for the tag, but I am seeing if anyone knows how to deal with these issue.
Snippet of the HTML doc:
<score_result>
<Models>
<Model>
<Id>CLASS</Id>
<Description>Classifier Model 2.0</Description>
<Score>613</Score>
<Messages>
<Message>
<MessageType>RC</MessageType>
<Code>017111</Code>
<Description># of bananas, S&amp;Accounts Established</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>P11P</Code>
<Description>Absence of Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description>Presence of a Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description># of Inquiries</Description>
</Message>
</Messages>
</Model>
</Models>
</score_result>
I am specifically just trying to grab the Score value, <Score>613</Score>
from these HTML documents.
I first take my dataframe and put it into a tuple to iterate through each HTML document, create a BeautifulSoup Object, then try to find the tag with .find_all()
.
I get an empty string every time. I considered also using regex but wanted to see what other people think.
My code:
result = [(x,y) for x,y in zip(df['ID'], df['data'])]
Score_lst = []
for row in result:
try:
Bs_data = BS(row[1])
Score_lst.append(Bs_data.find_all('score'))
except:
print('Na')
Expected Output:
Score_lst
[613,
...,
...,
....]
The ... will be the other values I will parse.
CodePudding user response:
Here is one way to solve this conundrum:
from bs4 import BeautifulSoup as bs
html = '''
<score_result>
<Models>
<Model>
<Id>CLASS</Id>
<Description>Classifier Model 2.0</Description>
<Score>613</Score>
<Messages>
<Message>
<MessageType>RC</MessageType>
<Code>017111</Code>
<Description># of bananas, S&amp;Accounts Established</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>P11P</Code>
<Description>Absence of Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description>Presence of a Banana</Description>
</Message>
<Message>
<MessageType>RC</MessageType>
<Code>0111</Code>
<Description># of Inquiries</Description>
</Message>
</Messages>
</Model>
</Models>
</score_result>
'''
soup = bs(bs(html, 'html.parser').text, 'html.parser')
score = soup.select_one('Score')
print('And here is your score:', score.text)
Result in terminal:
And here is your score: 613