I'm trying to scrape all the content avilable after any b
tag until the next b
tag from a webpage. To let you visualize the whole picture, I'm attaching relevant html elements.
The following three are the content of b
tags Job Description
,This is an internship
and Qualifications
. So, when I go for any b
tag, I will want to scrape anything between that specific b
tag and the next b
tag.
I've tried like this:
import requests
from bs4 import BeautifulSoup
from itertools import takewhile
link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
desc = [i.get_text(strip=True) for i in takewhile(lambda tag: tag.name!='b', soup.select("div > b:contains('Job Description') ~ *"))]
print(desc)
Output I'm getting:
['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.', '', 'This is an internship.', 'Qualifications']
Output I wish to get (kicked out the content of last two b
tags):
['This job entails researching developing testing and deploying mechanical solutions.', '', 'Typical activities include:', '', 'Designing and developing thermal or mechanical tooling systems.', '', 'The ideal candidate should exhibit the following behavioral traits:', '', 'Work in a technically diverse environment-Adapt to changing requirements.Verbal and written communication.Project management.']
EDIT:
Here is another link for your test.
CodePudding user response:
Try this:
import requests
from bs4 import BeautifulSoup
link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
soup = BeautifulSoup(requests.get(link).text, "lxml")
desc = [
i.strip() for i in soup.find_all(text=True)
if i.strip() and i.parent.name != "b"
]
print("\n".join(desc))
Output:
This job entails researching developing testing and deploying mechanical solutions.
Typical activities include:
Designing and developing thermal or mechanical tooling systems.
The ideal candidate should exhibit the following behavioral traits:
Work in a technically diverse environment-Adapt to changing requirements.
Verbal and written communication.
Project management.
CodePudding user response:
This is supposed to work:
import requests
from bs4 import BeautifulSoup
link = 'https://filebin.varnish-software.com/tgmdhg8dp37tycmk/doc.html'
desc = []
res = requests.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select_one("div > b:contains('Job Description')").find_all_next():
if item.name=="b": break
desc.append(item.get_text(strip=True))
print(' '.join(desc[:-1]))