Unable to extract all the content after some "b" tag up to the next "b" tag-CodePudding

I'm trying to scrape all the content avilable after any b tag until the next b tag from a webpage. To let you visualize the whole picture, I'm attaching relevant html elements.

The following three are the content of b tags Job Description,This is an internship and Qualifications. So, when I go for any b tag, I will want to scrape anything between that specific b tag and the next b tag.

I've tried like this:

import requests
from bs4 import BeautifulSoup
from itertools import takewhile

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'

res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
desc = [i.get_text(strip=True) for i in takewhile(lambda tag: tag.name!='b', soup.select("div > b:contains('Job Description') ~ *"))]
print(desc)

Output I'm getting:

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.', '', 'This is an internship.', 'Qualifications']

Output I wish to get (kicked out the content of last two b tags):

['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.']

EDIT:

Here is another link for your test.

CodePudding user response：

Try this:

import requests
from bs4 import BeautifulSoup

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
soup = BeautifulSoup(requests.get(link).text, "lxml")
desc = [
    i.strip() for i in soup.find_all(text=True)
    if i.strip() and i.parent.name != "b"
]
print("\n".join(desc))

Output:

This job entails researching developing testing and deploying mechanical solutions.
Typical activities include:
Designing and developing thermal or mechanical tooling systems.
The ideal candidate should exhibit the following behavioral traits:
Work in a technically diverse environment-Adapt to changing requirements.
Verbal and written communication.
Project management.

CodePudding user response：

This is supposed to work:

import requests
from bs4 import BeautifulSoup

link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'

desc = []
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select_one("div > b:contains('Job Description')").find_all_next():
    if item.name=="b": break
    desc.append(item.get_text(strip=True))

print(' '.join(desc[:-1]))