I'm new in web scraping I'm trying to scrape indeed for practice. But I encounter a problem, I want to scrape job title only but it scrape all the span including the "new". Below is my code
from bs4 import BeautifulSoup as bs
import requests
def extract(page):
url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
r = requests.get(url)
soup = bs(r.content,'lxml')
return soup
def transform(soup):
results = soup.find_all('div',class_='slider_container')
for item in results:
job_title=item.find('span').text
print(job_title)
c = extract(0)
transform(c)
When I run the code the result is:
new
new
Python Developer
Python Developer
new
Jr. Python Developer
Python Developer
Python Developer
Python Developer
new
new
Junior Web Developer (Web Scraping)
new
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)
Expected out put should be the job title only not including 'new':
Python Developer
Python Developer
Jr. Python Developer
Python Developer
Python Developer
Python Developer
Junior Web Developer (Web Scraping)
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)
CodePudding user response:
The problem is that not all <span>
's contain a "job title", so you'll have to check if a title
attribute exists in the <span>
tag.
Instead of:
job_title=item.find('span').text
use:
job_title = item.find(lambda tag: tag.name == "span" and "title" in tag.attrs).text
Or using a CSS selector:
job_title = item.select_one("span[title]").text
Output:
Back-End Developer | Python/Django
Python Developer
Python Developer
Jr. Web Developer
PYTHON DEVELOPER
Front-End Developer - Consultant - Digital Customer - Philip...
...
CodePudding user response:
You can use an if
condition to exclude the word 'new'.
Try this one:
from bs4 import BeautifulSoup as bs
import requests
def extract(page):
url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
r = requests.get(url)
soup = bs(r.content,'lxml')
return soup
def transform(soup):
results = soup.find_all('div',class_='slider_container')
for item in results:
job_title=item.find('span').text
if job_title !='new': # <<<----- Edited line here!
print(job_title)
c = extract(0)
transform(c)