Home > Net >  Struggling to parse HTML with BS4
Struggling to parse HTML with BS4

Time:10-29

I have this HTML snippet

<div >
   <div ><span  data-user-id="568352418596587458" title="discorduser#1234">Discord User</span> <span ><a href="#chatlog__message-container-854963254185698547">16-Jan-22 12:33 PM</a></span></div>
   <div ><a href="imageurl here"> <img alt="Image attachment"  loading="lazy" src="imageurl here" title="Image: image title.jpg (2.12 MB)"/> </a></div>
</div>

and when i run this python code:

from bs4 import BeautifulSoup, Tag
html = open("test.html", encoding='utf-8', buffering=100000).read()
soup = BeautifulSoup(html, 'lxml')
allMessages = soup.find_all('div', class_="chatlog__message-primary")
discordId = soup.find('span', {'data-user-id':'568352418596587458'})
for message in allMessages:
    if discordId in message:
        print (message)

it does not return anything but i can do

for message in allMessages:
    print (discordId)
    

and it returns the span with all elements, I cant get it to filter or

for div in soup.find_all('div', class_='chatlog__attachment'):
            print (div.a['href'])

but then i lose the ability to filter based off data-user-id

CodePudding user response:

You could not use in in this case to check, if your element is available, it should look more like:

for message in allMessages:
    if message.find('span', {'data-user-id':'568352418596587458'}):
        print (message)
        print (message.a['href'])

An alternativ would be to use:

for e in soup.select('div.chatlog__message-primary:has([data-user-id="568352418596587458"])'):
    print (e.a['href'])

Example

from bs4 import BeautifulSoup

html = '''
<div >
   <div ><span  data-user-id="568352418596587458" title="discorduser#1234">Discord User</span> <span ><a href="#chatlog__message-container-854963254185698547">16-Jan-22 12:33 PM</a></span></div>
   <div ><a href="imageurl here"> <img alt="Image attachment"  loading="lazy" src="imageurl here" title="Image: image title.jpg (2.12 MB)"/> </a></div>
</div>
'''

soup = BeautifulSoup(html)

allMessages = soup.find_all('div', class_="chatlog__message-primary")

for e in soup.select('div.chatlog__message-primary:has([data-user-id="568352418596587458"])'):
    print (e.a['href'])

Output

#chatlog__message-container-854963254185698547
  • Related