Home > Net >  Why am I not seeing any results in my output from extracting indeed data using python
Why am I not seeing any results in my output from extracting indeed data using python

Time:10-15

I am trying to run this code in idle 3.10.6 and I am not seeing any kind of data that should be extracted from Indeed. All this data should be in the output when I run it but it isn't. Below is the input statement

#Indeed data
import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract(page):
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko"}
    url = "https://www.indeed.com/jobs?q=Data&l=United States&sc=0kf:jt(internship);&vjk=a2f49853f01db3cc={page}"
    r = requests.get(url,headers)
    soup = BeautifulSoup(r.content, "html.parser")
    return soup

def transform(soup):
    divs = soup.find_all("div", class_ = "jobsearch-SerpJobCard")
    for item in divs:
        title = item.find ("a").text.strip()
        company = item.find("span", class_="company").text.strip()
        try:
            salary = item.find("span", class_ = "salarytext").text.strip()
        finally:
            salary =  ""
        summary = item.find("div",{"class":"summary"}).text.strip().replace("\n","")

        job = {
            "title":title,
            "company":company,
            'salary':salary,
            "summary":summary
        }
        joblist.append(job)

joblist = []

for i in range(0,40,10):
    print(f'Getting page, {i}')
    c = extract(10)
    transform(c)

df = pd.DataFrame(joblist)
print(df.head())
df.to_csv('jobs.csv')

Here is the output I get

Getting page, 0
Getting page, 10
Getting page, 20
Getting page, 30
Empty DataFrame
Columns: []
Index: []

Why is this going on and what should I do to get that extracted data from indeed? What I am trying to get is the jobtitle,company,salary, and summary information. Any help would be greatly apprieciated.

CodePudding user response:

The URL string includes {page}, bit it's not an f-string, so it's not being interpolated, and the URL you are fetching is:

https://www.indeed.com/jobs?q=Data&l=United States&sc=0kf:jt(internship);&vjk=a2f49853f01db3cc={page}

That returns an error page.

So you should add an f before opening quote when you set url.

Also, you are calling extract(10) each time, instead of extract(i).

CodePudding user response:

This is the correct way of using url

 url = "https://www.indeed.com/jobs?q=Data&l=United States&sc=0kf:jt(internship);&vjk=a2f49853f01db3cc={page}".format(page=page)

 r = requests.get(url,headers)

here r.status_code gives an error 403 which means the request is forbidden.The site will block your request from fullfilling.use indeed job search Api

  • Related