web scrapping indeed with python returning specific results-CodePudding

I was following along with this video that demonstrates web-scrapping indeed.com to get data into a CSV file. https://www.youtube.com/watch?v=eN_3d4JrL_w when I followed step by step I got results for a senior accountant in Charlotte, NC. yet when I tried to run the main again this time for data analyst and remote it kept giving me the same results...I tried to even manually set up the URL to give me the needed result but it kept giving me the same results. so I tried to get the main bulk and start a new kernel with it and it threw an error that records were not defined.. though it was. I'm new to web scrapping and still learning. I'm sure it's something obvious that I just can't see...I would appreciate your help

My Code: # Improting the needed libraries import csv import requests from datetime import datetime from bs4 import BeautifulSoup

    #getting the url
    def get_url (position,location):

        templete = 'https://www.indeed.com/jobs?q={}&l={}'
        url = templete.format(position,location)
        return url
    #getting the record
    def get_record(card):
       atag1=card.h2.a.span
       job_title = atag1.get('title')
       atag2= card.h2.a
       job_url='https://indeed.com' atag2.get('href')
       company =card.find('span','companyName').text.strip()
       Location = card.find('div','companyLocation').text.strip()
       summary =card.find('div','job-snippet').text.strip()
       posted_date = card.find('span','date').text.strip()
       Today = datetime.today().strftime('%Y-%m-%d')

    try: 
        salary =  card.find('div','metadata estimated-salary-container').text.strip()
    except AttributeError:
        salary = ''
        
    record = (job_title, job_url , Location ,company ,posted_date ,Today,summary, salary)
    return record

    #writing the main function
    def main(position,location):
        records = []
        url = get_url(position,location)

        while True:
              response=requests.get(url)
              soup = BeautifulSoup(response.text,'html.parser')
              cards=soup.find_all('div','job_seen_beacon')
              for card in cards:
                  record=get_record(card)
                  records.append(record)
              try:
    
                  url='https://indeed.com' soup.find('a',{'aria-label':'Next'}).get('href')
              except AttributeError:
                  break
        
     #save the results

      with open('dataanalyst.csv','w',newline='',encoding= 'utf-8') as f:
            writer= csv.writer(f)

            writer.writerow(['JobTitle','Location','PostDate','ExtractionDate', 
            'Summary','Salary','JobUrl'])

            writer.writerows(records)

CodePudding user response：

You somehow managed to take a relatively simple task (scraping jobs from indeed) and transform it into something unnecessarily complex, with function over function. Understandably, you ended up getting lost in this complexity, and you messed up the spacing/indentation (and probably the imports). Maybe it's ok as a learning exercise, or maybe not, as something complex will get you lost and discourage you from further learning. The idea is - learn to walk before learning to run. Below you can find the corrected code (I kept all your functions, just fixed the indentations, and also changed the filename to something relevant to the job search):

import requests
from bs4 import BeautifulSoup
import csv 
from datetime import datetime

#getting the url
def get_url (position,location):

    templete = 'https://www.indeed.com/jobs?q={}&l={}'
    url = templete.format(position,location)
    return url
#getting the record
def get_record(card):
    atag1=card.h2.a.span
    job_title = atag1.get('title')
    atag2= card.h2.a
    job_url='https://indeed.com' atag2.get('href')
    company =card.find('span','companyName').text.strip()
    Location = card.find('div','companyLocation').text.strip()
    summary =card.find('div','job-snippet').text.strip()
    posted_date = card.find('span','date').text.strip()
    Today = datetime.today().strftime('%Y-%m-%d')

    try: 
        salary =  card.find('div','metadata estimated-salary-container').text.strip()
    except AttributeError:
        salary = ''

    record = (job_title, job_url , Location ,company ,posted_date ,Today,summary, salary)
    return record

#writing the main function
def main(position,location):
    records = []
    url = get_url(position,location)

    while True:
          response=requests.get(url)
          soup = BeautifulSoup(response.text,'html.parser')
          cards=soup.find_all('div','job_seen_beacon')
          for card in cards:
              record=get_record(card)
              records.append(record)
          try:

              url='https://indeed.com' soup.find('a',{'aria-label':'Next'}).get('href')
          except AttributeError:
              break

         #save the results

    with open(f'{position}-{location}.csv','w',newline='',encoding= 'utf-8') as f:
        writer= csv.writer(f)

        writer.writerow(['JobTitle','Location','PostDate','ExtractionDate', 
        'Summary','Salary','JobUrl'])

        writer.writerows(records) 
main('business manager', 'Geneva')

When you run it, it will create a csv file named after your job title and location search, 'business manager-Geneva.csv' in this case, looking like this:

    JobTitle    Location    PostDate    ExtractionDate  Summary Salary  JobUrl
Manager Global Business Coordination    https://indeed.com/rc/clk?jk=6b7b72bad220482b&...   Hybrid remote in Aurora, IL ALDI    PostedPosted 2 days ago 2022-07-22  Identifying opportunities for business improve...   NaN
Operations Business Manager https://indeed.com/rc/clk?jk=abccb02951be5d36&...   Elgin, IL 60123 John B. Sanfilippo  PostedPosted 2 days ago 2022-07-22  Keeps a close eye on business and economic tre...   Estimated  62.2