I tried the following code to scrap the links in the links list but i get None output
links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']
for link in links:
page = requests.get(link)
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find(id="JobDescriptionContainer")
print(div)
The page html is something like this:
<div id="Job view">
<div>
<div>
<div>
<span>
<div>
<div>
<header>
<div>
<div>
<div id="JobDescriptionContainer">
<div>
<div>
<p... text>
<p...text>
<p...text>
<h3 Responsabilities>
<ul>
<li>....<li/>
<li>....<li/>
<li>....<li/>
<h3 Qualifications>
<ul>
<li>....<li/>
<li>....<li/>
<li>....<li/>
I want to get all the info from each link to create a data frame with all the link's information. The text I want to get from each link is the text below the div whose name is 'JobDescriptionContainer'(the real links list contains 900 links) Also i will like to separate in different data frame columns the text below 'responsibilities' and 'Qualifications' Can someone give me a hand with this?
CodePudding user response:
You must add User-Agent
to the headers
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626 Safari/537.36'}
links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']
for link in links:
page = requests.get(link, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find(id="JobDescriptionContainer")
print(div)
CodePudding user response:
you need to add a header to make the page think that your program is a browser
import requests
from bs4 import BeautifulSoup
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
user_agent = user_agent_list[0]
#Set the headers
headers = {'User-Agent': user_agent}
links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-',
'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-',
'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']
for link in links:
page = requests.get(link,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find(id="JobDescriptionContainer")
print(div)