Home > Mobile >  Scrape span title
Scrape span title

Time:10-16

I'm new in web scraping I'm trying to scrape indeed for practice. But I encounter a problem, I want to scrape job title only but it scrape all the span including the "new". Below is my code

from bs4 import BeautifulSoup as bs
import requests

def extract(page):

  url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
  r = requests.get(url)
  soup = bs(r.content,'lxml')
  return soup

def transform(soup):
  results = soup.find_all('div',class_='slider_container')
  for item in results:
    job_title=item.find('span').text
    print(job_title)
c = extract(0)
transform(c)

When I run the code the result is:

new
new
Python Developer
Python Developer
new
Jr. Python Developer
Python Developer
Python Developer
Python Developer
new
new
Junior Web Developer (Web Scraping)
new
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)

Expected out put should be the job title only not including 'new':

Python Developer
Python Developer
Jr. Python Developer
Python Developer
Python Developer
Python Developer
Junior Web Developer (Web Scraping)
Junior Web Developer Fullstack
Back End Developer (Work-from-Home)

CodePudding user response:

The problem is that not all <span>'s contain a "job title", so you'll have to check if a title attribute exists in the <span> tag.

Instead of:

job_title=item.find('span').text

use:

job_title = item.find(lambda tag: tag.name == "span" and "title" in tag.attrs).text

Or using a CSS selector:

job_title = item.select_one("span[title]").text

Output:

Back-End Developer | Python/Django
Python Developer
Python Developer
Jr. Web Developer
PYTHON DEVELOPER
Front-End Developer - Consultant - Digital Customer - Philip...
...

CodePudding user response:

You can use an if condition to exclude the word 'new'.

Try this one:

from bs4 import BeautifulSoup as bs
import requests

def extract(page):

  url = f'https://ph.indeed.com/jobs?q=python developer&l=Manila&start={page}'
  r = requests.get(url)
  soup = bs(r.content,'lxml')
  return soup

def transform(soup):
  results = soup.find_all('div',class_='slider_container')
  for item in results:
    job_title=item.find('span').text
    if job_title !='new': # <<<----- Edited line here!
        print(job_title)
c = extract(0)
transform(c)
  • Related