Home > front end >  beautifulsoup returns None for any element I try
beautifulsoup returns None for any element I try

Time:04-06

I'm building a fully automated get-a-job application, funny enough the automation portion is fairly simple, however the scrapping not so much.

In short, requests beautifulsoup has worked for the majority of domains I am scrapping, however nothing works when trying the same process on workable pages:

import requests
from bs4 import BeautifulSoup as bs

session = requests.Session()
url = 'https://apply.workable.com/breederdao-1/j/602097ACC9/'
req = session.get(url)

title = soup.find('h1', {'data-ui': 'job-title'})
print(title)

>>> None

details = soup.find('span', {'data-ui': 'job-location'})
print(details)

>>> None

Both elements are under body, however when I try to fetch the page's title I do get what I expect:

title_0 = soup.find('title')
print(title_0)

>>> <title>Data Analyst (Fully Remote) - BreederDAO</title>

I tried using await HTMLSEssion / AsyncHTMLSession as well, but so long as the element is inside of body, every find() still returns None.

Can anyone educate me on this? My current hypothesis is that the website has some kind of anti-scrapping mechanism, but I have zero idea where to even start looking. This element does look extra sus though:

<html...
  <head>...</head>
  <body>
    .
    .
    .
    <noscript>
      <iframe height="0" width="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WKS7WTT&amp;gtm_auth=SGnzIn3pcB7S4fevFXOKPQ&amp;gtm_preview=env-2&amp;gtm_cookies_win=x" style="display: none; visibility: hidden;">
        #document
          <!DOCTYPE html>
          <html lang="en">
            <head>
              <meta charset="utf-8">
              <title>ns</title>
            </head>
            <body>
              " "
            </body>
          </html>
      </iframe>
    </noscript>
    .
    .
    .
  </body>
</html>

CodePudding user response:

The data you see is loaded from external URL via javascript. To load it you can use requests module. For example:

import json
import requests


# 602097ACC9 is from your URL
url = "https://apply.workable.com/api/v2/accounts/breederdao-1/jobs/602097ACC9"
data = requests.get(url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["title"])
print(", ".join(data["location"].values()))

Prints:

Data Analyst (Fully Remote)
Philippines, PH, Makati, Metro Manila
  • Related