Home > OS >  empty output when scrapping loking for id name with beautiful soup
empty output when scrapping loking for id name with beautiful soup

Time:07-18

I tried the following code to scrap the links in the links list but i get None output

links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']
 
for link in links:
    page = requests.get(link)
    soup = BeautifulSoup(page.text, 'html.parser')
    div = soup.find(id="JobDescriptionContainer")
    print(div)

The page html is something like this:

<div id="Job view">
<div>
    <div>
        <div>
            <span>
            <div>
            <div>
                <header>
                <div>
                    <div>
                        <div id="JobDescriptionContainer">
                            <div>
                                <div>   
                                    <p... text>
                                    <p...text>
                                    <p...text>
                                    <h3 Responsabilities>
                                    <ul>
                                        <li>....<li/>
                                        <li>....<li/>
                                        <li>....<li/>
                                    <h3 Qualifications>
                                    <ul>
                                        <li>....<li/>
                                        <li>....<li/>
                                        <li>....<li/>

I want to get all the info from each link to create a data frame with all the link's information. The text I want to get from each link is the text below the div whose name is 'JobDescriptionContainer'(the real links list contains 900 links) Also i will like to separate in different data frame columns the text below 'responsibilities' and 'Qualifications' Can someone give me a hand with this?

CodePudding user response:

You must add User-Agent to the headers

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626 Safari/537.36'}

links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-', 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']

for link in links:
    page = requests.get(link, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    div = soup.find(id="JobDescriptionContainer")
    print(div)

CodePudding user response:

you need to add a header to make the page think that your program is a browser

    import requests


    from bs4 import BeautifulSoup


    user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    ]
    user_agent = user_agent_list[0]
    #Set the headers 
    headers = {'User-Agent': user_agent}

    links = ['https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_968a1afd&cb=1658020530267&jobListingId=1007823714104&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-21320f35b2a9f6f4-',
     'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&ea=1&cs=1_11e4da95&cb=1658020530267&jobListingId=1007830003866&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-6ad629ee4ebc1885-',
     'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=1136043&s=58&guid=0000018209b985b3814e13e6abec3f6b&src=GD_JOB_AD&t=SR&vt=w&cs=1_0ae3fe0c&cb=1658020530267&jobListingId=1008006371431&jrtk=3-0-1g84rj1gokf0t801-1g84rj1hdghre800-f24a3ad703626f08-']

    for link in links:
        page = requests.get(link,headers=headers)
        soup = BeautifulSoup(page.content, 'html.parser')
        div = soup.find(id="JobDescriptionContainer")
        print(div)
  • Related