Hi I'm doing a python
course and for one of our assignments today, we're supposed to extract the job listings on: https://remoteok.com/remote-python-jobs
Here is a screenshot of the html in question: python jobs f12
And here is what I've written so far:
import requests
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
extract("python")
print(len(table))
gives me 1 and
print(len(tbody))
gives me 131.
So it's pretty clear that I've made a mistake somewhere, but I'm having trouble identifying the cause.
One suspicion I have is that when I do request the html text and parse it with BeautifulSoup
I am not getting the full webpage. But otherwise, I'm really not sure what I'm doing wrong here..
CodePudding user response:
tbody does appear on the web page but isn't pulled into the table variable by beautifulsoup.
I encountered this before. The solution is to get your tags directly from selenium.
But there is only one jobsboard and one tbody on the web page; so you could just skip tbody and look for a more useful tag.
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.
PyCharm sllows you to view the contents of each variable with ease.
This code will allow you to view the web page HTML source code in a text file:
outputFile = r"C:\Users\user\Documents\HP Laptop\Documents\Documents\Jobs\DIT\IDMB\OutputZ.txt"
def update_output_file(pageSource: str):
with open(outputFile, 'w', encoding='utf-8') as f:
f.write(pageSource)
f.close()
CodePudding user response:
Requests
do not manipulate or render a website like a browser will do, it only provide the static HTML - Websites content is generated dynamically by JavaScript
that converts some JSON data into structure.
Use these to extract your data:
[json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld json"]')]
Result:
[{'@context': 'http://schema.org', '@type': 'JobPosting', 'datePosted': '2022-09-04T05:21:13 00:00', 'description': 'About the Team\n\nThe Design Infrastructure team designs, builds, and ships the Design System foundations and UI components used in all of DoorDash’s products, on all platforms. Specifically, the iOS team works closely with designers and product engineering teams across the company to help shape the Design System, and owns the shared UI library for iOS – developed for both SwiftUI and UIKit.\nAbout the Role\n\nWe are looking for a lead iOS engineer who has a strong passion for UI components and working very closely with design. As part of the role you will be leading the iOS initiative for our Design System, which will include working closely with designers and iOS engineers on product teams to align, develop, maintain, and evolve the library of foundations and UI components; which is adopted in all our products.\n\nYou will report into the Lead Design Technologist for Mobile on our Design Infrastructure team in our Product Design organization. This role is 100% flexible, and can b\n Apply now and work remotely at DoorDash', 'baseSalary': {'@type': 'MonetaryAmount', 'currency': 'USD', 'value': {'@type': 'QuantitativeValue', 'minValue': 70000, 'maxValue': 120000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'jobLocation': [{'address': {'@type': 'PostalAddress', 'addressCountry': 'United States', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}], 'applicantLocationRequirements': [{'@type': 'Country', 'name': 'United States'}], 'title': 'Lead Design Technologist iOS', 'image': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png', 'occupationalCategory': 'Lead Design Technologist iOS', 'workHours': 'Flexible', 'validThrough': '2022-12-03T05:21:13 00:00', 'hiringOrganization': {'@type': 'Organization', 'name': 'DoorDash', 'url': 'https://remoteok.com/doordash', 'sameAs': 'https://remoteok.com/doordash', 'logo': {'@type': 'ImageObject', 'url': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png'}}}, {'@context': 'http://schema.org', '@type': 'JobPosting', 'datePosted': '2022-09-03T00:00:09 00:00', 'description': "We’re seeking a senior core, distributed systems engineers to build dev tools. At [Iterative](https://iterative.ai) we build [DVC](https://dvc.org) (9000 ⭐on GitHub) and [CML](https://cml.dev) (2000 ⭐ on GitHub) and a few other projects that are not released yet. It's a great opportunity if you love open source, dev tools, systems programming, and remote work. Join our well-funded remote-first team to build developer tools to see how your code is used by thousands of developers every day!\n\nABOUT YOU\n\n- Excellent communication skills and a positive mindset