Home > Software design >  How to exclude unwanted tags when using Beautifulsoup in Python
How to exclude unwanted tags when using Beautifulsoup in Python

Time:09-26

I'm practicing python scraping on indeed.com with Beautifulsoup.

While extracting 'job location' with [div class companyLocation], what I want is to get the location string right after 'div class="companyLocation"'. (in below html, "United States")

But for some cases, there are extra 'a aria-label' or 'span' clauses which contains unwanted strings such as " 1 location" or etc.

I couldn't figure out how to get rid of these. So I ask for your advice.

<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&amp;jk=d724dab9a2d2af2c&amp;dest=/jobs?q=python&limit=50&grpKey=kAO5nvwVmAPOkxWgAwHyBwN0Y2w%3D" rel="nofollow">
 1 location</a></span>

<span class="remote-bullet">•</span><span>Remote</span></div>, United States 1 location•Remote

Here's my Python codes for your reference. The problem arises 'if a.string is None:' case.

you could see above div span html clauses with this code: print(f"{a}, {a.text}")

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            pass

        #problem(below)
        if a.string is None:
            print(f"{a}, {a.text}")

CodePudding user response:

You've mixed up the if statements, try the following instead:

import requests
from bs4 import BeautifulSoup

url = "https://www.indeed.com/jobs?q=python&limit=50"

extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})

for soup_job in soup_jobs:
    for a in soup_job.select("div.companyLocation"):
        if a.string is not None:
            print(f"{a}, {a.text}")

Output:

<div class="companyLocation">United States</div>, United States
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Boulder, CO</div>, Boulder, CO
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Allen, TX</div>, Allen, TX
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York State</div>, New York State
<div class="companyLocation">Austin, TX</div>, Austin, TX
<div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">Cary, NC</div>, Cary, NC
<div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
<div class="companyLocation">Houston, TX</div>, Houston, TX

Now it works just fine.

CodePudding user response:

is this working?

    #problem(below)
    if a.string is None:
        data=''
        for child in a.children:
            if not child.name and child != '':
                data =child
        print(data)
  • Related