I am practicing using Beautifulsoup and scraping data from Indeed. I am a relatively novice Python coder and am new to Beautifulsoup but I've been able to figure out most of what I'm trying to do except grabbing the hrefs of each job posting within the search results on Indeed. Most of the information is nested within this div as shown in the attached image:
The href that I need is right above in the a-tag (for the first posting, and is in similar location for the rest of the postings). It seems the job links all have a similar format (indeed url /pagead/ unique identifiers). So far I have been able to grab the first of these hrefs by doing:
link1 = soup.find('a',{'class':'tapItem'}).get('href')
indeed_link='https://indeed.com'
job_full_link=indeed_link link1
which returns:
https://indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0BYwoYS5IKUNHtA0a2VJhnZaPA0uEqIlEtc2XBlIiwK2z_X_68BR8FDAa4lu8N0xeCPwzwEnA8fXiK4iQSEmPwTPepfI6vD2vAIjZkkxpjBBMQUv338KUlip1EOk09_cn2LwmJdZfFHW0-AI7SZQhu1kIQsWTuRTOsU1vuAYvarCELllpMjt_GHp_65BONysimbVWU32exjeilFXm_q51osn1zTWwhznG16bEYsjNkVT231ngYVuvoC3RBW5qn2IB0yR0T3ppMCF4nVaIMUg2yvjXVLsbdbNYgj_ckFk4jrStGLrXIoTrozdnqm3fxToPHdshPAVD7771cWJDflltxdMjmVEdP2f74y2Gc1IAJBaNtq-GweslVoetCVqneDAWtDx4fDODfUv44tpOPE3rZycEp6SLUjAjcYpUW9qG5AJjaUOIU6MwVxZe6Xi1nECNwvoZrEpYXkCBvC3KbMg4DdMhoni660wPq8oW4DXKuz0ffj50lr_cNu&p=0&fvj=1&vjs=3
For starters, I'm not sure that is the best way to do it. There are other hrefs within that 'tapItem' class so I feel like my code only seems like it is working since that is the first href. I'm trying to create a loop to snag all of the job links and append them which is where I'm stuck now and am not sure how to set that up. Any ideas/pointers?
This is my first post on StackOverflow so let me know if I need to add more context! Thanks in advance.
CodePudding user response:
Note
find()
/ select_one()
Returns only the first occurrence of your selection
find_all()
/ select()
Returns a resultset of all occurrences it could find with your selection
How to fix ?
Use find_all()
/ select()
to generate a resultset you could iterate later
Example
import requests
from bs4 import BeautifulSoup
html = requests.get('https://de.indeed.com/Jobs?q=Data Engeneering&from=sug&vjk=7fb07edbe78d1d3a').text
soup = BeautifulSoup(html, 'lxml')
indeed_link='https://indeed.com'
links = [indeed_link a['href'] for a in soup.select('a.tapItem')]
for link in links:
do something....