Home > Enterprise >  Unable to extract all the content after some "b" tag up to the next "b" tag
Unable to extract all the content after some "b" tag up to the next "b" tag

Time:09-28

I'm trying to scrape all the content avilable after any b tag until the next b tag from a webpage. To let you visualize the whole picture, I'm attaching relevant html elements.

The following three are the content of b tags Job Description,This is an internship and Qualifications. So, when I go for any b tag, I will want to scrape anything between that specific b tag and the next b tag.

I've tried like this:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
desc = [i.get_text(strip=True) for i in takewhile(lambda tag: tag.name!='b', soup.select("div > b:contains('Job Description') ~ *"))]
print(desc)

Output I'm getting:

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.', '', 'This is an internship.', 'Qualifications']

Output I wish to get (kicked out the content of last two b tags):

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.']

EDIT:

Here is another link for your test.

CodePudding user response:

Try this:

import requests
from bs4 import BeautifulSoup

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
soup = BeautifulSoup(requests.get(link).text, "lxml")
desc = [
    i.strip() for i in soup.find_all(text=True)
    if i.strip() and i.parent.name != "b"
]
print("\n".join(desc))

Output:

This job entails researching developing testing and deploying mechanical solutions.
Typical activities include:
Designing and developing thermal or mechanical tooling systems.
The ideal candidate should exhibit the following behavioral traits:
Work in a technically diverse environment-Adapt to changing requirements.
Verbal and written communication.
Project management.

CodePudding user response:

This is supposed to work:

import requests
from bs4 import BeautifulSoup

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'

desc = []
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select_one("div > b:contains('Job Description')").find_all_next():
    if item.name=="b": break
    desc.append(item.get_text(strip=True))

print(' '.join(desc[:-1]))
  • Related