Scraping Glassdoor returns duplicate entries-CodePudding

So I am trying to scrape job posts from Glassdoor using Requests, Beautiful Soup and Selenium. The entire code works except that, even after scraping data from 30 pages, most entries turn out to be duplicates (almost 80% of them!). Its not a headless scraper so I know it is going to each new page. What could be the reason for so many duplicate entries? Could it be some sort of anti-scraping tool being used by Glassdoor or is something off in my code?

The result turns out to be 870 entries of which a whopping 690 are duplicates!

My code:

    
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(10)
    
    # Getting to the page where we want to start scraping
    jobs_search_title = driver.find_element(By.ID, 'KeywordSearch')
    jobs_search_title.send_keys('Data Analyst')
    jobs_search_location = driver.find_element(By.ID, 'LocationSearch')
    
    time.sleep(1)
    
    jobs_search_location.clear()
    jobs_search_location.send_keys('United States')
    click_search = driver.find_element(By.ID, 'HeroSearchButton')
    click_search.click()
    
    for page_num in range(1,10):
        time.sleep(10)
        
        res = requests.get(driver.current_url)
        soup = BeautifulSoup(res.text,'html.parser')
        
        time.sleep(2)

        companies = soup.select('.css-l2wjgv.e1n63ojh0.jobLink')
        for company in companies:
            companies_list.append(company.text)
    
        positions = soup.select('.jobLink.css-1rd3saf.eigr9kq2')
        for position in positions:
            positions_list.append(position.text)
    
        locations = soup.select('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
        for location in locations:
            locations_list.append(location.text)
    
        job_post = soup.select('.eigr9kq3')
        for job in job_post:
            salary_info = job.select('.e1wijj242')
            if len(salary_info) > 0:
                for salary in salary_info:
                    salaries_list.append(salary.text)
            else:
                salaries_list.append('Salary Not Found')
    
        ratings = soup.select('.e1rrn5ka3')
        for index, rating in enumerate(ratings):
            if len(rating.text) > 0:
                ratings_list.append(rating.text)
            else:
                ratings_list.append('Rating Not Found')
        
        
        next_page = driver.find_elements(By.CLASS_NAME, 'e13qs2073')[1]
        next_page.click()
        time.sleep(5)
        try:
            close_jobalert_popup = driver.find_element(By.CLASS_NAME, 'modal_closeIcon')
        except:
            pass
        else:
            time.sleep(1)
            close_jobalert_popup.click()        
        continue
    
    #driver.close()
    print(f'{len(companies_list)} jobs found for you!')
    
    global glassdoor_dataset
    
    glassdoor_dataset = pd.DataFrame(
    {'Company Name': companies_list,
     'Company Rating': ratings_list,
     'Position Title': positions_list,
     'Location' : locations_list,
     'Est. Salary' : salaries_list
    })
    
    glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')```

CodePudding user response：

You're going way too fast. You need to put some waits.

I see you have put Implicit Waits. Trying putting Explicit Waits instead.

Something like this:

(put your own conditions. you can try invisibility element too. like if something is invisible and then visible to ensure you are on next page now) if not then increase your time.sleep()

WebDriverWait(driver, 40).until(expected_conditions.visibility_of_element_located(
    (By.XPATH, '//*[@id="wrapper"]/section/div/div/div[2]/button[2]')))

CodePudding user response：

I don't think the repetition is due to a code issue - I think glassdoor just starts cycling results after a while. [If interested, see this gist for some stats - basically, from the 7th page or so, most of the 1st page results seem to be shown on every page onwards. I did a small test manually - with only 5 listings, by id, and even directly on an un-automated browser, they started repeating after a while....]

My suggestion would be to just filter them before looping to the next page - there's a data-id attribute for each li wrapped around the listings which seems to be a unique identifier. If we add that to the other columns' lists, we can start collecting only un-collected listings; if you just edit the for page_num loop to:

    for page_num in range(1, 10):
        time.sleep(10)

        scrapedUrls.append(driver.current_url)

        res = requests.get(driver.current_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        # soup = BeautifulSoup(driver.page_source, 'html.parser') # no noticable improvement

        time.sleep(2)

        filteredListings = [
            di for di in soup.select('li[data-id]') if
            di.get('data-id') not in datId_list
        ]
        datId_list  = [di.get('data-id') for di in filteredListings] 

        companies_list  = [
            t.select_one('.css-l2wjgv.e1n63ojh0.jobLink').get_text(strip=True)
            if t.select_one('.css-l2wjgv.e1n63ojh0.jobLink')
            else None for t in filteredListings
        ]

        positions_list  = [
            t.select_one('.jobLink.css-1rd3saf.eigr9kq2').get_text(strip=True)
            if t.select_one('.jobLink.css-1rd3saf.eigr9kq2')
            else None for t in filteredListings
        ]

        locations_list  = [
            t.select_one(
                '.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0').get_text(strip=True)
            if t.select_one('.css-l2fjlt.pr-xxsm.css-iii9i8.e1rrn5ka0')
            else None for t in filteredListings
        ]

        job_post = [
            t.select('.eigr9kq3 .e1wijj242') for t in filteredListings
        ]
        salaries_list  = [
            'Salary Not Found' if not j else
            (j[0].text if len(j) == 1 else [s.text for s in j])
            for j in job_post
        ]

        ratings_list  = [
            t.select_one('.e1rrn5ka3').get_text(strip=True)
            if t.select_one('.e1rrn5ka3')
            else 'Rating Not Found' for t in filteredListings
        ]

and, if you added datId_list to the dataframe, it could serve as a meaningful index

    dfDict = {'Data-Id': datId_list,
              'Company Name': companies_list,
              'Company Rating': ratings_list,
              'Position Title': positions_list,
              'Location': locations_list,
              'Est. Salary': salaries_list
              }
    for k in dfDict:
        print(k, len(dfDict[k]))
    glassdoor_dataset = pd.DataFrame(dfDict)
    glassdoor_dataset.set_index('Data-Id', drop=True)

    glassdoor_dataset.to_csv(r'glassdoor_jobs_scraped.csv')