I'm practicing python scraping on indeed.com with Beautifulsoup.
While extracting 'job location' with [div class companyLocation], what I want is to get the location string right after 'div class="companyLocation"'. (in below html, "United States")
But for some cases, there are extra 'a aria-label' or 'span' clauses which contains unwanted strings such as " 1 location" or etc.
I couldn't figure out how to get rid of these. So I ask for your advice.
<div class="companyLocation">United States
<span><a aria-label="Same Python Developer job in 1 other location" class="more_loc" href="/addlLoc/redirect?tk=1fgg7b6pa306m001&jk=d724dab9a2d2af2c&dest=/jobs?q=python&limit=50&grpKey=kAO5nvwVmAPOkxWgAwHyBwN0Y2w%3D" rel="nofollow">
1 location</a></span>
<span class="remote-bullet">•</span><span>Remote</span></div>, United States 1 location•Remote
Here's my Python codes for your reference. The problem arises 'if a.string is None:' case.
you could see above div span html clauses with this code: print(f"{a}, {a.text}")
import requests
from bs4 import BeautifulSoup
url = "https://www.indeed.com/jobs?q=python&limit=50"
extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})
for soup_job in soup_jobs:
for a in soup_job.select("div.companyLocation"):
if a.string is not None:
pass
#problem(below)
if a.string is None:
print(f"{a}, {a.text}")
CodePudding user response:
You've mixed up the if
statements, try the following instead:
import requests
from bs4 import BeautifulSoup
url = "https://www.indeed.com/jobs?q=python&limit=50"
extracts_url = requests.get(url)
extracts_soup = BeautifulSoup(extracts_url.text, 'html.parser')
soup_jobs = extracts_soup.find_all("div", {"class": "job_seen_beacon"})
for soup_job in soup_jobs:
for a in soup_job.select("div.companyLocation"):
if a.string is not None:
print(f"{a}, {a.text}")
Output:
<div class="companyLocation">United States</div>, United States
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Boulder, CO</div>, Boulder, CO
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Allen, TX</div>, Allen, TX
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">New York State</div>, New York State
<div class="companyLocation">Austin, TX</div>, Austin, TX
<div class="companyLocation">Research Triangle Park, NC</div>, Research Triangle Park, NC
<div class="companyLocation">New York, NY</div>, New York, NY
<div class="companyLocation">Cary, NC</div>, Cary, NC
<div class="companyLocation">Raleigh, NC</div>, Raleigh, NC
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation"><span>Remote</span></div>, Remote
<div class="companyLocation">Houston, TX</div>, Houston, TX
<div class="companyLocation">Bellevue, WA</div>, Bellevue, WA
<div class="companyLocation">Houston, TX</div>, Houston, TX
Now it works just fine.
CodePudding user response:
is this working?
#problem(below)
if a.string is None:
data=''
for child in a.children:
if not child.name and child != '':
data =child
print(data)